Archive for the ‘data journalism’ tag
The term ‘data journalism’ is misleading. It gives the impression of journalists as statisticians, crouched over computer databases doing SQL queries and writing code. This may be one aspect of data journalism but only a tiny one. And it is certainly not why data journalism is the future of news.
Data journalism is shorthand for being able to cope with information abundance. It is not just about numbers. Neither is it about being a mathmo or a techie. It means being able to combine three things: individual human intelligence, networked intelligence, and computing power.
We live in an age of information abundance – both static and dynamic. By static data I mean things like annual reports, historical crime data, censuses (censi?). This is information that is collected – often by public bodies – categorized, and published. By dynamic data I mean real time information, flowing in through micro-blogs, social networks, live cameras.
Static data, which used to lie relatively dormant in archives and libraries is increasingly being made public (on places like data.gov.uk and data.gov). On data.gov.uk there are already 5,600 data sets. In January most of the UK’s local councils (293 out of 326 at the last count) published all their spending records over £500.
Dynamic data comes at us in a torrent. 25 billion tweets were sent in 2010. 100 million new twitter accounts were created in 2010. 35 hours of video are uploaded to YouTube every minute. There were 600 million people on Facebook by the end of 2010 (data from royal.pingdom). If you want, you can watch live CCTV cameras on the streets of London.
Data journalism is about coping with both of these. It’s about:
- being able to work out what is happening in Tahrir square in real time from tweets, video footage, and social networks – while at the same time contextualising that with diplomatic news from Cairo and Washington (see services like Storyful and Sulia)
- being able to upload, add metadata and analyse thousands of pages of legal documents (e.g. via Document Cloud)
- being able to map crime data (e.g. see Oakland Crimespotters)
- being able to harness the intelligence of the ‘crowd’ to unearth stories from mountains of detailed data; as the Guardian did with MPs expenses, getting 170,000 read and checked in just over three days (and, separately, to identify all the Doctor Who baddies)
- knowing how to use metadata – in publishing, searching and using information (heard of hNews? Or RDFa? Or Open Calais?)
- building the tools that enable people to see the relevance of public information to them (as the New York Times did with its series on toxic waters)
A data journalist should have the news sense of a traditional journalist, a broad and deep social media presence, and be tech-savvy enough to be able to do pivot tables in Excel and know how to use tools like Google Refine. The ability to code and do database queries would be an added bonus, but is not a pre-requisite.
This is where Stieg Larsson and his ‘The Girl with/who..’ series comes in. Larsson got data journalism. He understood how rare it is for a journalist with news sense and story-telling skills to be a tech wiz as well. So he didn’t try to combine everything in one character. He created two: Mikael Blomqvist and Lisbeth Salander. Together they can source the data, analyse the data, and tell the story.
Compare Salander with Wikileaks. Wikileaks spent four years publishing leaked data without much public profile. Then it started to turn its data into stories (the edited footage of the US Apache helicopter attack in Iraq) and to partner with existing news organisations and journalists (particularly Nick Davies and David Leigh at the Guardian) and it became one of the most well known organisations on the planet (the leak of the Afghan warlogs, the Iraq warlogs and the diplomatic cables helped of course).
Data needs journalism. This is where the rather misleading phrase ‘data journalism’ is also quite helpful. There is a myth that all we need to do to make the world a better place is to make everything open and transparent. Openness will help, but it only gets us halfway there. Without people and organisations able and willing to take the open data, clean it, structure it, add metadata to it, create tools to analyse it, analyse it, and tell stories from it, then the data might as well go back in the archive.
Start with Jonathan Stray’s excellent reading list on his blog
The Guardian’s Datablog is one of the pioneers in this area, particularly notable are the way it dealt with MPs expenses, how it maps things like Swine Flu, and how it handled the Wikileaks warlogs data
Propublica has published a series of guides on collecting data
Conrad Quilty Harper has a good run down of open data and its uses (good and bad) at The Telegraph
See panel discussion about data journalism at The Book Club, Shoreditch, with Mark Stephens and Ben Leapman, 9-2-11
This post was first published at PBS Mediashift Ideas Lab on Monday 2nd August, 2010.
Soon every news organization will have its own “bunker” — a darkened room where a hand-picked group of reporters hole up with a disk/memory stick/laptop of freshly opened data, some stale pizza and lots of coffee.
Last year the U.K.’s Daily Telegraph secreted half a dozen reporters in a room for nine days with about 4 million records of politicians’ expenses. They were hidden away even from the paper’s own employees. Now we learn that reporters from the Guardian, the New York Times and Der Spiegel did the same with Julian Assange of WikiLeaks somewhere in the Guardian’s offices in King’s Cross, London.
There is a wonderful irony that open data can generate such secrecy. Of course the purpose of this secrecy is to find — and protect — scoops buried in the data. From the perspective of many news organizations, these scoops are the main benefit of data dumps. Certainly the Daily Telegraph benefitted hugely from the scoops it dug out of the MPs’ expenses data. Weeks of front pages on the print paper, national uproar, multiple resignations, court cases and much soul searching about the state of parliamentary politics.
The Guardian, the New York Times and Der Spiegel have not been able to stretch the WikiLeaks Afghan logs over multiple weeks, but they did dominate the news for awhile, and stories will almost certainly continue to emerge.
These massive data releases are not going to go away. In fact, they’re likely to accelerate. The U.S. and U.K. governments are currently competing to see who can release more data sets. WikiLeaks will no doubt distribute more raw information, and WikiLeaks will spawn similar stateless news organizations. Therefore news organizations need to work out how best to deal with them, both to maximize the benefits to them and their readers, and to ensure they don’t do evil, as Google might say.
Here are just five (of many) questions news orgs should ask themselves when they get their next data dump:
1. How do we harness public intelligence to generate a long tail of stories? Though the Telegraph succeeded in unearthing dozens of stories from the Parliamentary expenses data, the handful of reporters in the bunker could never trawl through each of the millions of receipts contained on the computer disks. It was The Guardian that first worked out how to deal with this; it not only made the receipts available online but provided tools to search through them and tag them (see Investigate your MP’s expenses). This way it could harness the shared intelligence — and curiosity — of hundreds, if not thousands, more volunteer watchdogs, each of whom might be looking for a different story from the expenses data. As a result, the Guardian generated many more stories and helped nurture a community of citizen scrutineers
2. How do we make it personal? Massive quantities of data can be structured to be made directly relevant to whoever is looking at it. With crime data you can, for example, enable people to type in their postcode and see what crimes have happened in their neighborhood (e.g. San Francisco crimespotting). For MPs’ expenses, people could look up their own MP and scour his/her receipts. The Afghan logs were different in this respect, but OWNI, Slate.fr and Le Monde Diplomatique put together an app that allows you to navigate the logs by country, by military activity, and by casualties (see here). The key is to develop a front end that allows people to make the data immediately relevant to them.
3. How can use the data to increase trust? The expenses files, the Afghan logs, the COINs database (a massive database of U.K. government spending released last month) are all original documents that can be tagged, referenced and linked to. They enable journalists not only to refer back to the original source material, but to show an unbroken narrative flow from original source to final article. This cements the credibility of the journalism and gives the reader the opportunity to explore the context within the original source material. Plus, if published in linked data, the published article can be directly linked to the original data reference.
4. How do we best — and quickly — filter the data (and work out what, and what not, to publish)? Those that are best able to filter this data using human and machine methods are those who are most likely to benefit from it. Right now only a very small number of news organizations appear to be developing these skills, notably the Guardian, the New York Times, and the BBC. The skills, and algorithms, they develop will give them a competitive advantage when dealing with future data releases (read, for example, Simon Rogers on how the Guardian handled the 92,201 rows of data and how Alastair Dant dealt with visualizing IED events at FlowingData). These skills will also help them work out what not to publish, such as data that could put people in danger.
5. How can we ensure future whistleblowers bring their data to us? It’s impossible to predict where a whistleblower will take their information. John Wick, who brokered the MPs expenses disk to the Telegraph, went first to the Express, one of the U.K.’s least well resourced and least prepared national papers. But it is likely that the organizations that become known for handling big data sets will have more whistleblowers coming to them. Julian Assange went to the Guardian partly because the journalist Nick Davies sought him out in Brussels (from Clint Hendler in CJR) but Assange must also have been convinced the Guardian would be able to deal with the data.
The influence of the war logs continues to spin across the globe, particularly following the Afghan president’s comments. But it is not the first — and certainly won’t be the last — big data dump. Better that news organizations prepare themselves now.