Archive for the ‘hNews’ Category
This post was first published at PBS MediaShift Ideas Lab on 6th May 2011
The International Press Telecommunications Council (IPTC) has just launched rNews, a consistent, machine-readable way of expressing news metadata in RDFa (a linked data language). This post explains some of the differences between rNews and hNews and why, if you publish news on the web, you ought to be using one or the other.
In a now infamous incident at Cambridge University back in October 1946, mid-way through a seminar, the philosopher Ludwig Wittgenstein is said to have threatened the philosopher Karl Popper with a red-hot poker (the exact circumstances and use of the poker are still disputed, 65 years on). The argument? Over whether there are, or are not, such things as philosophical problems. Popper said there were, Wittgenstein said there were only puzzles.
Step into the similarly rarefied world of online publishing languages and, though you might not be threatened with a red-hot poker, someone will almost certainly wave its online equivalent at you — as we found when we were developing hNews — a news microformat — with the Associated Press.
We started, back in 2008, with a problem: Very few online news stories had consistent, machine-readable information about their provenance (i.e. basic stuff like who wrote it, who published it, when it was first published, etc.). This was a problem because without this information — or metadata — it was incredibly difficult to differentiate news from other content on the web, or to figure out where news had come from.
Two Solutions to the Problem
We searched about for a solution to the problem, thanks to grants from the Knight and MacArthur Foundations, and found not one but two. The first was microformats — which are straightforward, open mark-up formats built on existing standards. The second was RDFa, a method of embedding full RDF, the linked data language of the semantic web.
We made a decision to use microformats. We did this for highly pragmatic reasons. We figured that most news organizations (and journalists and bloggers) were not yet ready to make the big leap to linked data. The easier we made it to integrate consistent metadata, we thought, the more likely news organizations were to do it. Our chief concern was less about exactly how people made the provenance of online news more transparent, just that they did it.
The Associated Press came to a similar conclusion, and together we developed hNews. Our pragmatism has so far borne fruit. The hNews microformat has since been integrated in about 1,200 news sites in the U.S. This means that there must now be a hundred-plus million news stories on the web with hNews. And, the AP has based its new news registry business and its forthcoming rights clearinghouse around hNews.
This did not stop some semantic web evangelists from waving their metaphorical red-hot pokers, or from suggesting we were not born of parents in wedlock or other less warm and fuzzy responses.
So, when we learned that the IPTC were launching an equivalent of hNews in RDFa we were over the moon. Hooray! Now people have a choice to mark up their news in microformats or in linked data.
The Ambitious rNews
“Equivalent” is not quite right. rNews is more ambitious than hNews. If hNews is like a ham sandwich then rNews is like a baked Alaska. rNews covers lots of aspects of provenance and content. You can, if you want to mark up additional aspects of news stories, mix-and-match rNews with other RDF ontologies (i.e. different linked data vocabularies). It’s also more “correct” than hNews, but as a result more verbose and intrusive. It’s a much bigger change to existing HTML pages than hNews. That said, it is, by RDF standards, pretty straightforward. All this makes it a very good alternative way of creating consistent, machine-readable mark-up for news.
The big difference between two is in their complexity. Making a ham sandwich is much simpler and requires less expertise than cooking a baked Alaska. The same goes for hNews and rNews. As a result, my prediction is that rNews will be the format of choice for big news organizations who want to do things fully and properly and are willing to commit the time and resources (like the New York Times — which was central to the development of rNews). In the same way it will probably suit high end proprietary content management systems. For smaller news organizations, journalists and bloggers, hNews goes a good part of the way there and is much easier to integrate and lighter to use.
In other words, the two complement each other rather well, and ought to provide the foundations for consistent, machine-readable metadata for news.
Pros and Cons of Each Approach
The AP’s Stuart Myles was one of the creators of hNews and worked with the IPTC on rNews.
“The fact that hNews and rNews have similar names is no coincidence,” Myles told me via email. “To me, microformats and RDFa are two different technical approaches to the same challenge. Each approach has pros and cons and many tools that support one also work with the other.”
Evan Sandhaus of the New York Times, one of the original authors of rNews, also emphasizes the compatibility of the two standards: “rNews was designed from the start to provide publishers with many of the same features offered by hNews. And future versions of the rNews will likely bring the standards into even closer alignment,” he told me via email.
Should you care about hNews and rNews? If you publish news on the web then you most certainly should. The arrival of rNews and the continuing take-up of hNews show that metadata is central to the future of digital news. Consistent, machine-readable metadata makes your news easier to find, more distinguishable, more straightforward to check, more programmable, more targetable, and less hard to track. If you are not yet publishing your news with metadata then don’t be surprised if someone soon comes at you flailing a red-hot poker.
Evan Sandhaus (one of the original authors of rNews) has a good presentation, “All about rNews”
Stuart Myles (AP) gives “7 uses for rNews”
Last August I wrote a piece on PBS MediaShift about hNews, that is also applicable to rNews, “How Metadata Can Eliminate the Need for Pay Walls”
For those of a philosophical bent I recommend “Wittgenstein’s Poker: The story of a 10-minute argument between two great philosophers”, by David Edmonds and John Eidinow (Faber and Faber, 2001)
This is an edited version of a talk I gave on a panel about linked data and the semantic web at News: Rewired, on Thursday 16th December. The presentation slides can be seen here on slideshare.
Disclaimer: I’m not a technologist. I’m not a programmer. If you’re a geek then this piece isn’t meant for you. It’s for those of us trying to get to grips with the potential of technology and the web for news, politics, business and society, but without too much technical know-how.
What is linked data?
In the 18th century Voltaire wrote that the Holy Roman Empire was neither Holy, nor Roman, nor an Empire. You could say something similar about ‘linked data’. Linked data is neither ‘linked’ – in the way we think of hyper-linking on the web; nor is it ‘data’ – in the sense of numbers or databases. So what is it?
Data as ‘things’
The data part of linked data is really discrete ‘things’. Identifiable things like people, places, organisations, events. You are a discrete thing. I am a discrete thing. In real life there is one you, one me, one capital city called London. On the web there are likely to be many you’s – your Facebook profile, your LinkedIn profile, your flickr pages, pictures of you on other people’s pages, your blog, other blogs about you – you get the picture.
Trouble is, if you’re spread in different places over the web, how does it know it’s you? I’m certainly not the only Martin Moore. There is Daniel Martin Moore, the singer songwriter from Kentucky (who has a new album out). There is Martin Moore the under-20 Ireland front-row rugby player (whose career I’m now following with interest). There is Martin Moore the cellar master from South Africa (I’m jealous of his job). There is Martin Moore QC. There is Martin Moore kitchens…
But how does the web know this? How does the web (and therefore people searching for me, or trying to recommend things to me) know who I am?
Well, if stuff about me is put on the web as linked data then I am given a unique identifier. A sort of human ISBN. A web snowflake. So that, whenever I publish something, or someone publishes something about me, then the web knows it’s me and not one of the many other Martin Moores out there.
Linking as grammar
Now we move onto the ‘linked’ bit of linked data. A hyperlink is a dumb link – in the sense that it just says ‘click on me and I’ll take you to another web page’. It doesn’t know why these pages are linked together, or what the relationship between them is, you have to work that out from the context.
If you publish it as linked data then you explain the relationship between the two things you’re linking. This person wrote this article. This organisation launched this product. This event happened at this time. It’s a bit like grammar, where you have subject – verb – object, i.e. John kissed Mary. (In linked data language this is called a ‘triple’ though being non-techie I prefer to think of it in grammatical terms.)
Suddenly, instead of having an indistinguishable soup of stuff on the web, you have lots and lots of distinct entities with clearly defined relationships.
Good reasons for publishing in linked data
So what? I hear you say. Why should I care about this in my day job? Well, there are a bunch of reasons why this could be a big deal. Here are just a few:
Publish in linked data and you can make your site much richer – both in terms of links and, potentially, in terms of automatically generated content. The BBC’s natural history pages are filled with interesting stuff about animals – including video clips, information about distribution, habitats, behaviours (e.g. see this one on the lion – complete with great sound clip of a lion growling and snarling). But only some of this content is produced by the BBC (mostly the video). Lots of the other information is automatically sourced from elsewhere – sites like WWF and Wikipedia. By combining it all together the BBC has pages that are far deeper and more threaded into the web.
This can have a great knock-on effect on where your page/site comes in search engine results. The BBC’s natural history pages, for example, which used to come somewhere way down the rankings, now appear in the top 10 results on Google (when I typed ‘lion’ into to google.co.uk earlier this week, the BBC page came fourth, while aardvark came third).
Linked data can also help with sourcing. Now that lots of primary data sources are being published as linked data (e.g. on data.gov.uk) you can link directly back to the raw figures that you’re writing about. So if you write a piece about the rise in cars thefts in south Wales, people can follow a line straight from your piece to the Home Office data on which it was based.
It can improve accreditation. By providing clear, consistent and unambiguous information about who wrote something, who published it, when it was published, where it was written etc. then the producer gets better credit, and the person reading has the tools to judge its credibility.
It can make searching really smart. Let’s say you wanted to search for all the composers who worked in Vienna between 1800 and 1875. Right now that’s pretty tricky – or at least it might take a bit of digging to work out. But if the information was published in linked data format you could just search for all composers who worked in Vienna between 1800 and 1875. Because the web itself becomes a sort of distributed database.
Finally, but perhaps most importantly, linked data can create an environment that enables innovation and the creation of new services. Suddenly it becomes possible to build really smart stuff based on the way in which things are linked together. The BBC’s World Cup site did just this in the summer of 2010. Publishing huge amounts of information – more than any team of journalists could put together themselves – sourced from lots of different places. The New York Times has now publishes in linked data and encourages people to build new stuff to leverage it. There is a tutorial for building a web app to show NYT coverage of a school’s alumni, for example – see a finished app here.
Other companies are starting to use linked data and other semantic information to build recommendation engines (like GetGlue). People can start adding value to data that you would never have thought of.
A final warning
The basic premise of linked data is wonderfully simple. You link discrete things together in such a way that we know the relationship between them (subject-verb-object). Once linked, the web then starts to have an artificial intelligence of its own.
But putting this basic premise into action is more complicated. Publishing in linked data for the first time is not for the faint hearted (we now publish journalisted.com in linked data so learnt for ourselves how complex it can be). You can find yourself quite quickly mired in the intricacies of linked data formats, vocabularies and many acronyms.
Though there are ways to move towards linked data without plunging in head first. Just publishing structured metadata is a very good start (for which there are various plugins for open source CMSs like WordPress). Microformats are also a much easier entry point for those wanting to introduce some metadata to what they publish (e.g. hNews for news).
Linked data is remarkable. It’s also a little scary. But the sooner people understand its potential and start making their information more ‘semantic’, the healthier and more navigable the web will be.
Far be it for me to question the brilliance of Google, but in the case of its new news meta tagging scheme, I’m struggling to work out why it is brilliant or how it will be successful.
First, we should applaud the sentiment. Most of us would agree that it is A Good Thing that we should be able to distinguish between syndicated and non-syndicated content, and that we should be able to link back to original sources. So it is important to recognize that both of these are – in theory – important steps forward both from the perspective of news and the public.
But there are a number of problems with the meta tag scheme that Google proposes.
Problems With Google’s Approach
Meta tags are clunky and likely to be gamed. They are clunky because they cover the whole page, not just the article. As such, if the page contains more than one article or, more likely, contains lots of other content besides the article (e.g. links, promos, ads), the meta tag will not distinguish between them. More important is that meta tags are, traditionally, what many people have used to game the web. Put in lots of meta tags about your content, the theory goes, and you will get bumped up the search engine results. Rather than address this problem, the new Google system is likely to make it worse, since there will be assumed to be a material value to adding the “original source” meta tag.
Though there is a clear value in being able to identify sources, distinguishing between an “original source” as opposed to a source is fraught with complications. This is something that those of us working on hNews, a microformat for news, have found when talking with news organizations. For example, if a journalist attends a press conference then writes up that press conference, is that the original source? Or is it the press release from the conference with a transcript of what was said? Or is it the report written by another journalist in the room published the following day etc.? Google appears to suggest they could all be “original sources”, but if this extends too far then it is hard to see what use it is.
Even when there is an obvious original source, like a scientific paper, news organizations rarely link back to it (even though it’s easy using a hyperlink). The BBC – which is generally more willing to source than most – has, historically, tended to link to the front page of a scientific publication or website rather than to the scientific paper itself (something the Corporation has sought to address in its more recent editorial guidelines). It is not even clear, in the Google meta-tagging scheme, whether a scientific paper is an original source, or the news article based on it is an original source.
And what about original additions to existing news stories? As Tom Krazit wrote on CNET news,
… the notion of “original source” doesn’t take into account incremental advances in news reporting, such as when one publication advances a story originally broken by another publication with new important details. In other words, if one publication broke the news of Prince William’s engagement while another (hypothetically) later revealed exactly how he proposed, who is the “original source” for stories related to “Prince William engagement,” a hot search term on Google today?
Something else Google’s scheme does not acknowledge is that there are already methodologies out there that do much of what it is proposing, and are in widespread use (ironic given Google’s launch title “Credit where credit is due”). For example, our News Challenge-funded project, hNews addresses the question of syndicated/non-syndicated, and in a much simpler and more effective way. Google’s meta tags do not clash with hNews (both conventions can be used together), but neither do they build on its elements or work in concert with them.
One of the key elements of hNews is “source-org” or the source organization from which the article came. Not only does this go part-way towards the “original source” second tag Google suggests, it also cleverly avoids the difficult question of how to credit a news article that may be based on wire copy but has been adapted since — a frequent occurence in journalism. The Google syndication method does not capture this important difference. hNews is also already the standard used by the U.S.’s biggest syndicator of content, the Associated Press, and is also used by more than 500 professional U.S. news organizations.
It’s also not clear if Google has thought about how this will fit into the workflow of journalists. Every journalist we spoke to when developing hNews said they did not want to have to do things that would add time and effort to what they already do to gather, write up, edit and publish a story. It was partly for this reason that hNews was made easy to integrate to publishing systems; it’s also why hNews marks information up automatically.
Finally, the new Google tags only give certain aspects of credit. They give credit to the news agency and the original source but not to the author, or to when the piece was first published, or how it was changed and updated. As such, they are a poor cousin to methodologies like hNews and linked data/RDFa.
Ways to Improve
In theory Google’s initiative could be, as this post started by saying, a good thing. But there are a number of things Google should do if it is serious about encouraging better sourcing and wants to create a system that works and is sustainable. It should:
- Work out how to link its scheme to existing methodologies — not just hNews but linked data and other meta tagging methods.
- Start a dialogue with news organizations about sourcing information in a more consistent and helpful way
- Clarify what it means by original source and how it will deal with different types of sources
- Explain how it will prevent its meta tagging system from being misused such that the term “original source” fast becomes useless
- Use its enormous power to encourage news organizations to include sources, authors, etc. by ranking properly marked-up news items over plain-text ones
It is not clear whether the Google scheme – as currently designed – is more focused on helping Google with some of its own problems sorting news or with nurturing a broader ecology of good practice.
One cheer for intention, none yet for collaboration or execution.
This article was first posted at PBS MediaShift Ideas Lab on Thursday 18th November.
The San Francisco Chronicle was founded in 1865. It is the only daily broadsheet newspaper in San Francisco – and is published online at SFgate.com. In the 1960s Paul Avery was a police reporter at the Chronicle when he started investigating the so-called ‘Zodiac Killer’. And, earlier this year Mark Fiore won a Pulitzer Prize for his animated online cartoons for the paper (well worth watching his cartoon with Snuggly the security beardemonstrating how to make the internet ‘wire tap friendly’).
The Chronicle is also one of 577 US news sites now publishing articles with hNews(full list here).
hNews is the news microformat we developed with the Associated Press that makes the provenance of news articles clear, consistent and machine readable. A news article with hNews will – by definition – identify its author, its source organisation, its title, when it was published and – in most cases – the license associated with its use and a link to the principles to which it adheres (e.g. see AP essential news). It could also have where it was written, when it was updated, and a bunch of other useful stuff.
Essentially, hNews makes the provenance of a news article a lot more transparent – which is good news for whoever produces the article (gains credit, creates potential revenue models etc.), and good news for the end user (better able to assess its provenance, greater credibility etc.).
Up to now, though we have been aware that many sites have been integrating hNews, there has not been a published list of these sites. This seemed to us a little unsatisfactory. So we went out and found as many of them as we could and have now published them on a list as an open Google doc.
There are, I understand, a few hundred more sites that have either already integrated hNews or are in the process of integrating it. We haven’t found them yet but will add them when we do. If you know of one (or if you are one) please let us know and we’ll add it.
If you’re interested in integrating hNews and are wondering why you would, you can read a piece I wrote for PBS MediaShift (‘How metadata can eliminate the need for paywalls’), see the official specification at hNews microformats wiki, watch an hNews presentation by Stuart Myles, view a (slightly dated) slideshow on why it creates ‘Value Added News’, or see how to add hNews to WordPress.
hNews was developed as part of the transparency initiative of the Media Standards Trust, which aims to make news on the web more transparent. The initiative has been funded by the MacArthur Foundation and the Knight Foundation. You can read more about the transparency initiative elsewhere on this site.
This post was first published on the Media Standards Trust site on Tuesday 12th October, 2010
Update: I’m grateful to Max Cutler for spotting a number of duplicate entries in the original list which have now been cleaned up. It’s still 577 sites since in the process of cleaning we found a few more. And, as I wrote in my original post, this number is by no means final. There are almost certainly a lot more sites publishing with hNews, it’s just a matter of finding them (through sweat and scrapers). So if you spot any that aren’t on the list, please let me know