Archive for the ‘hNews’ tag
This open letter to Google, Bing and Yahoo!, following the launch schema.org, was first published at www.mediastandardstrust.org on 7th June, 2011
Let me first say how good news it is to learn about the launch of schema.org. Consistent, structured metadata is a very good thing. Structured metadata will not only help search, it should provide a more solid foundation for the future of the web.
We also have a request. A request that you seriously consider integrating principles to schema.org/NewsArticle (expressed as rel-principles in hNews). This should be to the benefit of individuals and organisations producing news, and to the benefit of the public. Below I explain why.
We have been developing and evangelising about consistent metadata in news for over three years. During that time we successfully developed hNews – a microformat for news – with the Associated Press, and thanks to support from a Knight News Challenge award and from a MacArthur Foundation grant. hNews has now been integrated to over 1,200 news sites across the US.
hNews will continue to be relevant and useful to all those publishing in HTML4 since it is light, simple, and easy to integrate. For individuals and organisations who want the benefits of consistent metadata without having to make a major investment, hNews will be the most sensible approach.
But, as news organisations move to HTML5 it will make sense for them to adopt HTML5 standards. Microdata schema is one of these. This is a natural and positive development, and schema.org/NewsArticle contains many of the same values as hNews.
There is, however, an important property missing – principles. This property would provide a link to the statement of principles, if any, to which an article adheres. It does not define what those principles ought to be, or what they should or should not include, it just links to them. In hNews this is expressed as rel-principles. You can see an example of an embedded link to principles at the AP’s essential news – click on the blue ‘P’ at the top (e.g. http://apne.ws/imN772).
There are three reasons why a machine readable link to principles is so important and in the interests of schema.org:
- It tells people it’s news: we all used to know what people meant when they talked about ‘news’. It was that thing which was produced by journalists and published by news organisations. That is no longer the case. News can be produced and published by anyone and sits within a huge ecology of other media content. There is no easy way to tell if something is meant to be ‘news’ unless someone describes it as such
- It distinguishes news from other web content: link to news principles and suddenly people – and search engines – can distinguish news from other content on the web – particularly from personal, government or commercial content. This is not only helpful for search but has a social value too
- It explains what news is: news is generally informed by certain values, even if these are sometimes subconscious. We used to take these for granted when news was printed or broadcast. We can’t now. We need to know where news comes from and what has informed its production. Basic information like how wrote it and when it was written get us part way there. But to get any further we need to have access to the principles – if any – to which the news adheres.
Of course it is already perfectly possible for people to link to principles of their own accord using microdata or microformats. But, as we have learnt over the last three years, organisations are unlikely to add metadata unless they can see a direct advantage. If principles was within the core schema then it would significantly increase the likelihood that news organisations would add it.
Adding principles to schema.org/NewsArticle would benefit search and assessment. It would also help the future of news.
That is why we think it would make sense to integrate principles to schema.org. We would, of course, be delighted to talk more about it. Please do get in touch – all contact details at www.mediastandardstrust.org.
Far be it for me to question the brilliance of Google, but in the case of its new news meta tagging scheme, I’m struggling to work out why it is brilliant or how it will be successful.
First, we should applaud the sentiment. Most of us would agree that it is A Good Thing that we should be able to distinguish between syndicated and non-syndicated content, and that we should be able to link back to original sources. So it is important to recognize that both of these are – in theory – important steps forward both from the perspective of news and the public.
But there are a number of problems with the meta tag scheme that Google proposes.
Problems With Google’s Approach
Meta tags are clunky and likely to be gamed. They are clunky because they cover the whole page, not just the article. As such, if the page contains more than one article or, more likely, contains lots of other content besides the article (e.g. links, promos, ads), the meta tag will not distinguish between them. More important is that meta tags are, traditionally, what many people have used to game the web. Put in lots of meta tags about your content, the theory goes, and you will get bumped up the search engine results. Rather than address this problem, the new Google system is likely to make it worse, since there will be assumed to be a material value to adding the “original source” meta tag.
Though there is a clear value in being able to identify sources, distinguishing between an “original source” as opposed to a source is fraught with complications. This is something that those of us working on hNews, a microformat for news, have found when talking with news organizations. For example, if a journalist attends a press conference then writes up that press conference, is that the original source? Or is it the press release from the conference with a transcript of what was said? Or is it the report written by another journalist in the room published the following day etc.? Google appears to suggest they could all be “original sources”, but if this extends too far then it is hard to see what use it is.
Even when there is an obvious original source, like a scientific paper, news organizations rarely link back to it (even though it’s easy using a hyperlink). The BBC – which is generally more willing to source than most – has, historically, tended to link to the front page of a scientific publication or website rather than to the scientific paper itself (something the Corporation has sought to address in its more recent editorial guidelines). It is not even clear, in the Google meta-tagging scheme, whether a scientific paper is an original source, or the news article based on it is an original source.
And what about original additions to existing news stories? As Tom Krazit wrote on CNET news,
… the notion of “original source” doesn’t take into account incremental advances in news reporting, such as when one publication advances a story originally broken by another publication with new important details. In other words, if one publication broke the news of Prince William’s engagement while another (hypothetically) later revealed exactly how he proposed, who is the “original source” for stories related to “Prince William engagement,” a hot search term on Google today?
Something else Google’s scheme does not acknowledge is that there are already methodologies out there that do much of what it is proposing, and are in widespread use (ironic given Google’s launch title “Credit where credit is due”). For example, our News Challenge-funded project, hNews addresses the question of syndicated/non-syndicated, and in a much simpler and more effective way. Google’s meta tags do not clash with hNews (both conventions can be used together), but neither do they build on its elements or work in concert with them.
One of the key elements of hNews is “source-org” or the source organization from which the article came. Not only does this go part-way towards the “original source” second tag Google suggests, it also cleverly avoids the difficult question of how to credit a news article that may be based on wire copy but has been adapted since — a frequent occurence in journalism. The Google syndication method does not capture this important difference. hNews is also already the standard used by the U.S.’s biggest syndicator of content, the Associated Press, and is also used by more than 500 professional U.S. news organizations.
It’s also not clear if Google has thought about how this will fit into the workflow of journalists. Every journalist we spoke to when developing hNews said they did not want to have to do things that would add time and effort to what they already do to gather, write up, edit and publish a story. It was partly for this reason that hNews was made easy to integrate to publishing systems; it’s also why hNews marks information up automatically.
Finally, the new Google tags only give certain aspects of credit. They give credit to the news agency and the original source but not to the author, or to when the piece was first published, or how it was changed and updated. As such, they are a poor cousin to methodologies like hNews and linked data/RDFa.
Ways to Improve
In theory Google’s initiative could be, as this post started by saying, a good thing. But there are a number of things Google should do if it is serious about encouraging better sourcing and wants to create a system that works and is sustainable. It should:
- Work out how to link its scheme to existing methodologies — not just hNews but linked data and other meta tagging methods.
- Start a dialogue with news organizations about sourcing information in a more consistent and helpful way
- Clarify what it means by original source and how it will deal with different types of sources
- Explain how it will prevent its meta tagging system from being misused such that the term “original source” fast becomes useless
- Use its enormous power to encourage news organizations to include sources, authors, etc. by ranking properly marked-up news items over plain-text ones
It is not clear whether the Google scheme – as currently designed – is more focused on helping Google with some of its own problems sorting news or with nurturing a broader ecology of good practice.
One cheer for intention, none yet for collaboration or execution.
This article was first posted at PBS MediaShift Ideas Lab on Thursday 18th November.
The San Francisco Chronicle was founded in 1865. It is the only daily broadsheet newspaper in San Francisco – and is published online at SFgate.com. In the 1960s Paul Avery was a police reporter at the Chronicle when he started investigating the so-called ‘Zodiac Killer’. And, earlier this year Mark Fiore won a Pulitzer Prize for his animated online cartoons for the paper (well worth watching his cartoon with Snuggly the security beardemonstrating how to make the internet ‘wire tap friendly’).
The Chronicle is also one of 577 US news sites now publishing articles with hNews(full list here).
hNews is the news microformat we developed with the Associated Press that makes the provenance of news articles clear, consistent and machine readable. A news article with hNews will – by definition – identify its author, its source organisation, its title, when it was published and – in most cases – the license associated with its use and a link to the principles to which it adheres (e.g. see AP essential news). It could also have where it was written, when it was updated, and a bunch of other useful stuff.
Essentially, hNews makes the provenance of a news article a lot more transparent – which is good news for whoever produces the article (gains credit, creates potential revenue models etc.), and good news for the end user (better able to assess its provenance, greater credibility etc.).
Up to now, though we have been aware that many sites have been integrating hNews, there has not been a published list of these sites. This seemed to us a little unsatisfactory. So we went out and found as many of them as we could and have now published them on a list as an open Google doc.
There are, I understand, a few hundred more sites that have either already integrated hNews or are in the process of integrating it. We haven’t found them yet but will add them when we do. If you know of one (or if you are one) please let us know and we’ll add it.
If you’re interested in integrating hNews and are wondering why you would, you can read a piece I wrote for PBS MediaShift (‘How metadata can eliminate the need for paywalls’), see the official specification at hNews microformats wiki, watch an hNews presentation by Stuart Myles, view a (slightly dated) slideshow on why it creates ‘Value Added News’, or see how to add hNews to WordPress.
hNews was developed as part of the transparency initiative of the Media Standards Trust, which aims to make news on the web more transparent. The initiative has been funded by the MacArthur Foundation and the Knight Foundation. You can read more about the transparency initiative elsewhere on this site.
This post was first published on the Media Standards Trust site on Tuesday 12th October, 2010
Update: I’m grateful to Max Cutler for spotting a number of duplicate entries in the original list which have now been cleaned up. It’s still 577 sites since in the process of cleaning we found a few more. And, as I wrote in my original post, this number is by no means final. There are almost certainly a lot more sites publishing with hNews, it’s just a matter of finding them (through sweat and scrapers). So if you spot any that aren’t on the list, please let me know
You have to admire his chutzpah. Rupert Murdoch, the so-called nemesis of public interest news, is now being hailed by some as its potential savior. Sick and tired of people reading his news outlets for free online, Murdoch has erected pay walls around his sites (or some of them at least).
Anyone who wants to see what is published on thetimes.co.uk will have to pay at least £1. That includes search engines who are not even allowed to index the Times’ online content. Now we have to wait and see if the subscription revenues start rolling in.
Yet even those who hope the pay wall succeeds have reservations. Pay walls represent both a practical and philosophical shift in the provision of news on the net. They represent a shift from the openness that has defined the early history of the web, to a closed world much more reminiscent of the 20th century’s constrained media environment. Erect a pay wall and you immediately cut yourself off from much of the web community. You disable the vast majority of people from recommending, linking, commenting, quoting, and discussing.
It is for this reason that any forward thinking journalist cannot help but be disheartened by the pay wall. It cuts you off from a much bigger potential audience. It suffocates networked journalism, whereby you engage with your readers to source, expand, deepen, and extend your story. It limits your opportunity to enhance your own brand, as opposed to that of the publication. But worst of all, it turns its back on the reason for the net’s success — the flowering of millions of conversations. As the lawyer who stopped writing for the Times after it put up its pay wall said, “inside the paywall no-one can even hear you scream.”
Fortunately, there is an alternative. A way in which news can remain distributed, open, even re-usable. A way in which journalism can work with the grain of the web, and continue to grow, extend, and integrate. And it is a way — crucially — that journalism can still make money.
But first, a story.
LIBRARY OF ALEXANDRIA
In the fourth century BC, a student of Aristotle, Demetrius of Phaleron set up a library in Alexandria. It was a little different from the libraries we’re now familiar with. It had lecture halls, a dining room, meeting rooms, and a “walk.” It also had a reading room and lots of books (or scrolls as then were). Within a few decades it had acquired almost half a million scrolls, many containing multiple works. Such an abundance of scrolls would quickly have become unmanageable had it not been for Callimachus of Cyrene. Callimachus started “the first subject catalogue in the world, the Pinakes,” according to Roy Macleod in “The Library of Alexandria.” This was made up of six sections and catalogued some 120,000 scrolls of classical poetry and prose. His methods were then adopted and extended by other librarians.
Thanks in no small part to the cataloguing, people were able to build on each other’s knowledge. Scholars began to compare the texts and try to understand the reasons why they differed. Hence cross-textual analysis was born. People were able to contrast and evaluate various scientific methods. Archimedes (of “Eureka” fame) worked out methods for calculating areas and volumes while at the library that later formed the basis for calculus.
The library at Alexandria became the most famous of the ancient world, and spawned many further libraries and even whole university towns such as Bologna and Oxford. Yet had its books not been catalogued none of this might have happened. Had the books not had metadata giving basic details about who wrote them, when they were written, what they should be classified as, then there would not have been the foundations on which scholars could build.
Metadata is just a fancy word for information about information. A library catalogue is metadata because it categorizes the books and describes where you can find them. You find metadata on the side of every food packet, only we don’t call it metadata, we call it ingredients. The equivalent metadata about a news article would capture information about where it was written, who wrote it, when it was first published, when it was updated. All pretty basic stuff, but critical to properly identifying it and helping its distribution.
IMPORTANCE OF METADATA
Metadata did not matter so much when news was all tidily packaged together in a newspaper. You knew when something was published because it was inside that day’s paper. You knew who had published it because it was on the masthead and at the top of every page. There was — is — lots of metadata about news in newspapers, we just tend to take it all for granted.
The Internet, and the search engines and social networks that power the web, have broken the newspaper package down into discrete pieces of content. These atomized chunks — individual news articles, photographs, video clips, audio clips — are what we consume online. We do not read an online paper cover to cover, as we would a print paper. That would be exhausting. The BBC news website publishes about 150,000 words each day. To skim every individual article would take upwards of 17 hours. Instead we pick and choose, we unbundle.
Rather than seeing unbundling as a problem, news outlets should see it as an opportunity. An opportunity to distribute news all around the web. An opportunity to get readers to help sell their news – by recommending pieces to their colleagues and friends, and by linking to stories from their networks and blogs. The only thing news producers need to do before publishing a news article, is make sure it has metadata integrated to it. This way whenever people — or machines (i.e. search engines) — see it, they can also see its provenance, recognize what category of information it is, and give credit to its creator.
Having basic information about who produced something is to the mutual advantage of the person who wrote the article (or took the photograph or shot the film footage), and of the public who is reading it. The producer gets proper credit for what they created, and the public gets to see who created it — giving the news greater transparency and a measure of accountability.
When you think about it, it seems remarkable that so much content does not have this sort of metadata already. It is like houses not having house numbers or zip codes. Or like movies not having opening or closing credits. Or like a can of food without an ingredients label. As Jeff Jarvis wrote recently, “When it comes to products, we want to know: where it was made, by whom, in what conditions, using what materials, causing what damage, traveling what distance, with whose assurances of quality, with whose assurances of safety.” Why should news be any different?
hNews is just one of a number of methods of adding metadata. It is a simple, open standard that is free and that anyone can implement. We at the Media Standards Trust Britain developed it in partnership with Sir Tim Berners-Lee’s Web Science Trust, and in the latter stages by working with the Associated Press. (This was made possible thanks to two foundation grants, one from the MacArthur Foundation and one from the Knight Foundation. You can read my blog posts about the development of hNews over at Idea Lab, a Knight-funded sister site of PBS MediaShift.)
There are other ways to add metadata to news, for example using RDF or linked data. hNews is an easy entry point since it is built on existing standards (microformats), fits easily within any CMS (there is a WordPress and a blogger plugin), and is entirely reversible. Almost 500 news sites in the US have already implemented hNews, including the Associated Press andAOL. But you choose whichever one suits you best. (Some sample implementations are available here.)
Once hNews is added there are some immediate benefits. Every news article has consistent information about who wrote it, who published it, when it was published etc. built into it. Every article also has an embedded link to the license associated with its reuse (so ignorance is no excuse). And, every article has a link to the principles to which it adheres. These principles should not only help to distinguish the article as journalism, but should make the principles that define journalism — that are right now opaque and little understood by the public — transparent. Moreover, all this information is made ‘machine-readable’ by hNews. In other words a machine (like a search engine) can understand it.
Making this information machine-readable opens up the less immediate, but more exciting aspects of metadata. It creates an ecology of structured data that makes search more intelligent, enables innovation, and opens up new revenue opportunities.
It is a little known truth that much of the evolution of the web has already been driven by open standards. And that many of the uses of open standards are not at first apparent to those who create them. Who could have known that RSS (Really Simple Syndication) a simple standard for syndicating web content, would now be the way millions of people consume audio podcasts? Or that OAuth and OpenID would so simplify the sharing of private information across websites?
The openness and re-usability of hNews enables people to build stuff with it and on top of it. It allows you, for example, to add a “news ingredients” label to the bottom of each article. This is what Open Democracy are doing. Under each article that has hNews embedded they will automatically add an hNews icon. Scroll over this icon and you will get a pop-up box with all the basic details of the article (author, publish data/time, principles etc.). Rather like the ingredients on a food packet. Some of this information is hyperlinked so that you can click directly through to more information — like the license associated with re-use of the article. Imagine labels like these on all news articles. At a stroke you would have transformed their transparency and accountability.
Embedding metadata like hNews has countless other potential uses. As a simple illustration of the type of thing it enables, we built a browser plugin – itchanged.org – that allows you to track changes in news articles. Another application might be more intelligent recommendations (e.g. see readness.com). But most importantly, structuring data creates an environment in which invention becomes possible — in the same way, for example, that library catalogues do.
AP NEWS REGISTRY
It can also help news organizations work out ways to make money. For example, the Associated Press has built its News Registry on top of hNews. The news registry is AP’s way of tracking its news around the web so that it has much better metrics that it can use to charge more accurately for its content, and work out revenue sharing opportunities for advertising associated with its content.
How it does this is pretty straightforward. In addition to hNews the AP embeds an image file, probably a transparent pixel, to each news article. This file is equivalent to a photograph in a web page, except that it is not intended to be seen. But like a photograph in a web page, this image file has to be served up from a separate server — in this case AP’s servers. So whenever the article is viewed on a computer, the browser (Internet Explorer, Firefox etc.) notices the image file and asks AP’s server to deliver it. That way the AP knows who is reading the article. It’s a little like a carrier pigeon. The pigeon can fly wherever it likes but always knows where its home is.
Pay walls will rise and pay walls will fall. But in the world of information abundance in which we now live pay walls are a step backwards. If news wants to benefit from the remarkable openness and dynamism that the internet has unleashed then it should embrace the distributed network and take advantage of it, not turn its back.