Archive for the ‘linked data’ tag
This is an edited version of a talk I gave on a panel about linked data and the semantic web at News: Rewired, on Thursday 16th December. The presentation slides can be seen here on slideshare.
Disclaimer: I’m not a technologist. I’m not a programmer. If you’re a geek then this piece isn’t meant for you. It’s for those of us trying to get to grips with the potential of technology and the web for news, politics, business and society, but without too much technical know-how.
What is linked data?
In the 18th century Voltaire wrote that the Holy Roman Empire was neither Holy, nor Roman, nor an Empire. You could say something similar about ‘linked data’. Linked data is neither ‘linked’ – in the way we think of hyper-linking on the web; nor is it ‘data’ – in the sense of numbers or databases. So what is it?
Data as ‘things’
The data part of linked data is really discrete ‘things’. Identifiable things like people, places, organisations, events. You are a discrete thing. I am a discrete thing. In real life there is one you, one me, one capital city called London. On the web there are likely to be many you’s – your Facebook profile, your LinkedIn profile, your flickr pages, pictures of you on other people’s pages, your blog, other blogs about you – you get the picture.
Trouble is, if you’re spread in different places over the web, how does it know it’s you? I’m certainly not the only Martin Moore. There is Daniel Martin Moore, the singer songwriter from Kentucky (who has a new album out). There is Martin Moore the under-20 Ireland front-row rugby player (whose career I’m now following with interest). There is Martin Moore the cellar master from South Africa (I’m jealous of his job). There is Martin Moore QC. There is Martin Moore kitchens…
But how does the web know this? How does the web (and therefore people searching for me, or trying to recommend things to me) know who I am?
Well, if stuff about me is put on the web as linked data then I am given a unique identifier. A sort of human ISBN. A web snowflake. So that, whenever I publish something, or someone publishes something about me, then the web knows it’s me and not one of the many other Martin Moores out there.
Linking as grammar
Now we move onto the ‘linked’ bit of linked data. A hyperlink is a dumb link – in the sense that it just says ‘click on me and I’ll take you to another web page’. It doesn’t know why these pages are linked together, or what the relationship between them is, you have to work that out from the context.
If you publish it as linked data then you explain the relationship between the two things you’re linking. This person wrote this article. This organisation launched this product. This event happened at this time. It’s a bit like grammar, where you have subject – verb – object, i.e. John kissed Mary. (In linked data language this is called a ‘triple’ though being non-techie I prefer to think of it in grammatical terms.)
Suddenly, instead of having an indistinguishable soup of stuff on the web, you have lots and lots of distinct entities with clearly defined relationships.
Good reasons for publishing in linked data
So what? I hear you say. Why should I care about this in my day job? Well, there are a bunch of reasons why this could be a big deal. Here are just a few:
Publish in linked data and you can make your site much richer – both in terms of links and, potentially, in terms of automatically generated content. The BBC’s natural history pages are filled with interesting stuff about animals – including video clips, information about distribution, habitats, behaviours (e.g. see this one on the lion – complete with great sound clip of a lion growling and snarling). But only some of this content is produced by the BBC (mostly the video). Lots of the other information is automatically sourced from elsewhere – sites like WWF and Wikipedia. By combining it all together the BBC has pages that are far deeper and more threaded into the web.
This can have a great knock-on effect on where your page/site comes in search engine results. The BBC’s natural history pages, for example, which used to come somewhere way down the rankings, now appear in the top 10 results on Google (when I typed ‘lion’ into to google.co.uk earlier this week, the BBC page came fourth, while aardvark came third).
Linked data can also help with sourcing. Now that lots of primary data sources are being published as linked data (e.g. on data.gov.uk) you can link directly back to the raw figures that you’re writing about. So if you write a piece about the rise in cars thefts in south Wales, people can follow a line straight from your piece to the Home Office data on which it was based.
It can improve accreditation. By providing clear, consistent and unambiguous information about who wrote something, who published it, when it was published, where it was written etc. then the producer gets better credit, and the person reading has the tools to judge its credibility.
It can make searching really smart. Let’s say you wanted to search for all the composers who worked in Vienna between 1800 and 1875. Right now that’s pretty tricky – or at least it might take a bit of digging to work out. But if the information was published in linked data format you could just search for all composers who worked in Vienna between 1800 and 1875. Because the web itself becomes a sort of distributed database.
Finally, but perhaps most importantly, linked data can create an environment that enables innovation and the creation of new services. Suddenly it becomes possible to build really smart stuff based on the way in which things are linked together. The BBC’s World Cup site did just this in the summer of 2010. Publishing huge amounts of information – more than any team of journalists could put together themselves – sourced from lots of different places. The New York Times has now publishes in linked data and encourages people to build new stuff to leverage it. There is a tutorial for building a web app to show NYT coverage of a school’s alumni, for example – see a finished app here.
Other companies are starting to use linked data and other semantic information to build recommendation engines (like GetGlue). People can start adding value to data that you would never have thought of.
A final warning
The basic premise of linked data is wonderfully simple. You link discrete things together in such a way that we know the relationship between them (subject-verb-object). Once linked, the web then starts to have an artificial intelligence of its own.
But putting this basic premise into action is more complicated. Publishing in linked data for the first time is not for the faint hearted (we now publish journalisted.com in linked data so learnt for ourselves how complex it can be). You can find yourself quite quickly mired in the intricacies of linked data formats, vocabularies and many acronyms.
Though there are ways to move towards linked data without plunging in head first. Just publishing structured metadata is a very good start (for which there are various plugins for open source CMSs like WordPress). Microformats are also a much easier entry point for those wanting to introduce some metadata to what they publish (e.g. hNews for news).
Linked data is remarkable. It’s also a little scary. But the sooner people understand its potential and start making their information more ‘semantic’, the healthier and more navigable the web will be.
Far be it for me to question the brilliance of Google, but in the case of its new news meta tagging scheme, I’m struggling to work out why it is brilliant or how it will be successful.
First, we should applaud the sentiment. Most of us would agree that it is A Good Thing that we should be able to distinguish between syndicated and non-syndicated content, and that we should be able to link back to original sources. So it is important to recognize that both of these are – in theory – important steps forward both from the perspective of news and the public.
But there are a number of problems with the meta tag scheme that Google proposes.
Problems With Google’s Approach
Meta tags are clunky and likely to be gamed. They are clunky because they cover the whole page, not just the article. As such, if the page contains more than one article or, more likely, contains lots of other content besides the article (e.g. links, promos, ads), the meta tag will not distinguish between them. More important is that meta tags are, traditionally, what many people have used to game the web. Put in lots of meta tags about your content, the theory goes, and you will get bumped up the search engine results. Rather than address this problem, the new Google system is likely to make it worse, since there will be assumed to be a material value to adding the “original source” meta tag.
Though there is a clear value in being able to identify sources, distinguishing between an “original source” as opposed to a source is fraught with complications. This is something that those of us working on hNews, a microformat for news, have found when talking with news organizations. For example, if a journalist attends a press conference then writes up that press conference, is that the original source? Or is it the press release from the conference with a transcript of what was said? Or is it the report written by another journalist in the room published the following day etc.? Google appears to suggest they could all be “original sources”, but if this extends too far then it is hard to see what use it is.
Even when there is an obvious original source, like a scientific paper, news organizations rarely link back to it (even though it’s easy using a hyperlink). The BBC – which is generally more willing to source than most – has, historically, tended to link to the front page of a scientific publication or website rather than to the scientific paper itself (something the Corporation has sought to address in its more recent editorial guidelines). It is not even clear, in the Google meta-tagging scheme, whether a scientific paper is an original source, or the news article based on it is an original source.
And what about original additions to existing news stories? As Tom Krazit wrote on CNET news,
… the notion of “original source” doesn’t take into account incremental advances in news reporting, such as when one publication advances a story originally broken by another publication with new important details. In other words, if one publication broke the news of Prince William’s engagement while another (hypothetically) later revealed exactly how he proposed, who is the “original source” for stories related to “Prince William engagement,” a hot search term on Google today?
Something else Google’s scheme does not acknowledge is that there are already methodologies out there that do much of what it is proposing, and are in widespread use (ironic given Google’s launch title “Credit where credit is due”). For example, our News Challenge-funded project, hNews addresses the question of syndicated/non-syndicated, and in a much simpler and more effective way. Google’s meta tags do not clash with hNews (both conventions can be used together), but neither do they build on its elements or work in concert with them.
One of the key elements of hNews is “source-org” or the source organization from which the article came. Not only does this go part-way towards the “original source” second tag Google suggests, it also cleverly avoids the difficult question of how to credit a news article that may be based on wire copy but has been adapted since — a frequent occurence in journalism. The Google syndication method does not capture this important difference. hNews is also already the standard used by the U.S.’s biggest syndicator of content, the Associated Press, and is also used by more than 500 professional U.S. news organizations.
It’s also not clear if Google has thought about how this will fit into the workflow of journalists. Every journalist we spoke to when developing hNews said they did not want to have to do things that would add time and effort to what they already do to gather, write up, edit and publish a story. It was partly for this reason that hNews was made easy to integrate to publishing systems; it’s also why hNews marks information up automatically.
Finally, the new Google tags only give certain aspects of credit. They give credit to the news agency and the original source but not to the author, or to when the piece was first published, or how it was changed and updated. As such, they are a poor cousin to methodologies like hNews and linked data/RDFa.
Ways to Improve
In theory Google’s initiative could be, as this post started by saying, a good thing. But there are a number of things Google should do if it is serious about encouraging better sourcing and wants to create a system that works and is sustainable. It should:
- Work out how to link its scheme to existing methodologies — not just hNews but linked data and other meta tagging methods.
- Start a dialogue with news organizations about sourcing information in a more consistent and helpful way
- Clarify what it means by original source and how it will deal with different types of sources
- Explain how it will prevent its meta tagging system from being misused such that the term “original source” fast becomes useless
- Use its enormous power to encourage news organizations to include sources, authors, etc. by ranking properly marked-up news items over plain-text ones
It is not clear whether the Google scheme – as currently designed – is more focused on helping Google with some of its own problems sorting news or with nurturing a broader ecology of good practice.
One cheer for intention, none yet for collaboration or execution.
This article was first posted at PBS MediaShift Ideas Lab on Thursday 18th November.
[A version of this article was first published at PBS MediaShift IdeasLab]
On a news organization’s list of priorities, publishing articles as ‘linked data’ probably comes slightly above remembering to turn the computer monitors off in the evening and slightly below getting a new coffee machine.
It shouldn’t, and I’ll list 10 reasons why.
Before I do I should briefly explain what I mean by ‘linked data’. Linked data is a way of publishing information so that it can easily – and automatically – be linked to other, similar data on the web. For example, if I refer to ‘Paris’ in a news article it’s not immediately apparent to search engines whether that is Paris – France, Paris – Texas, or Paris Hilton (or indeed another Paris entirely). If published in linked data Paris would be linked to another reference point that would make clear which one it referred to (e.g. to the entry for Paris, France on dbpedia – the structured data version of wikipedia).
Until a short while ago I was reasonably clueless as to both the meaning and the value of linked data. I’m still far from being an expert, but enough people who are far smarter than me have convinced me that it’s worth trying. This was especially the case a couple of months back, at a News Linked Data Summit that we (the Media Standards Trust) organized with the BBC (which you can read about on a previous blog).
So, 10 reasons why news organizations should bump linked data up their priority list:
1. Linked data can boost SEO (search engine optimization)
People who tell you they can boost your SEO usually sound like witch doctors, telling you to tag all sorts of hocus pocus that doesn’t make rational sense or just seems like cynical populism. But at its simplest, SEO works through links. The more something is linked to, the higher it will come in people’s search results. So publishing content as linked data should, quite naturally, increase its SEO. A great example of this is the BBC’s natural history output. Type ‘Lion’ into Google and, chances are, a BBC linked data page will come in the first 10 results. This never used to happen until the BBC started tagging their natural history content as linked data.
2. Linked data allows others to link to your site much more easily
The world wide web is, more and more, being powered by algorithms; the Google search algorithm is perhaps the most obvious. But most sites now take advantage of some mechanized intelligence. ‘If you liked reading this, you might enjoy this…’ sort of thing. Problem is, algorithms – though intelligent – aren’t that intelligent. They have trouble telling the difference between, for example, Martin Moore (me), Martin Moore (kitchens), and Daniel Martin Moore (the Kentucky singer songwriter). But use linked data and they can. And once they can, sites like the BBC can link externally much more easily and intelligently.
3. Helps you build services based on your content
As it becomes increasingly difficult to get people to pay for news, so news organizations will need to build services based on their news – and other content – that people will pay for. You could, for example, provide a service that enabled people to compare schools in different areas, based on inspection reports, league tables, news reports, and parents’ stories. Creating services to do this is lots and lots easier if content is already made machine-readable through linked data.
4. Enables other people to build services based on your content – that you could profit from
Other people often have ideas you haven’t thought of. Other people also often have the space and time to experiment that you don’t have. Give them the opportunity to build stuff through linked data and they might come up with ‘killer apps’ that make you money. iphone apps anyone?
5. Allows you to link direct to source
You’re a news organization. Your brand is based partly on how much people trust the stuff you publish. Publish stuff in linked data and it enables you to link directly back to the report / research or statistics on which it was based – especially if that source is itself linked data (like http://data.gov.uk). That way, if you cite a crime statistic, say, you can link it directly back to the original source.
6. Helps journalists with their work
As a news organisation publishes more of its news content in linked data, so it can start providing its journalists with more helpful information to inform the articles they’re writing, and to make suggestions as to what else to link to when it’s published.
7. Throws bait over the paywall
Once content is behind a paywall it becomes invisible – unless you pay (that’s sort of the point). This is the same for joe public as for a search engine. But how are you, joe public, supposed to work out whether you want to pay for something if it’s invisible? Publish in linked data and there will be enough visible bits of information to help people work out if they want to pay for it. [This will probably be less of a deal with big search engines like Google, but more relevant to other search engines and third party services. Mind you, one of these bit players will, most likely, be the next Google or Facebook].
8. Makes data associated with your content dynamic
There is an ever growing mountain of information on the net that never gets updated. Pages devoted to football teams whose last score was added in 2006. Topic pages about political issues that haven’t seen a new story in months. But if those pages were filled with linked data, and linked to others that were too, they’d be automatically updated – rising from the dead like Frankenstein without you having to do diddly squat.
9. Start defining news events in linked data now and you could become a ‘canonical reference point’ (CRP)
What the heck is a canonical reference point, I hear you ask. Well, it’s a little like a virtual Grand Central Station. It’s a junction point for linked data; a hub which hundreds or even thousands of other sites link to as a way of helping to define their references. Examples of such hubs include: http://musicbrainz.org for music and musicians, data.gov.uk for UK gov stuff, http://dbpedia.org for almost anything. If you’re a news organization, why would you not want to be a hub?
10. Raises the platform for all
A web of linked data is a more intelligent web. A more mature and less superficial web. Not quite a semantic web, but getting there.
Of course, some of these benefits will come disproportionately to first movers (as with the BBC’s natural history pages). Which is exactly why news organizations, who have previously been pretty slow when it comes to web innovation, need to get their skates on.
More on linked data:
‘Linked data is blooming – why you should care’ on the ever readable Read Write Web, May 2009 (325 retweets to date)
A graphic of the linked data web: http://linkeddat
Tim Berners-Lee talking about linked data at TED 2009
My blog about our linked data summit
On Friday we co-hosted a news linked data summit, along with the BBC (and with some help from the Guardian).
The purpose of the day was to talk about linked data –what a linked data future might look like, what role linked data had for news organizations, and what news organizations should do about it. I’ll note down what I can remember from it in this blog, though given I was probably the least technical person there any tech references come with a big caveat (and I’d welcome being corrected on them).
The day was particularly opportune given that on Thursday Sir Tim Berners-Lee and Professor Nigel Shadbolt had launched data.gov.uk – a new site that provides a route into ‘a wealth of government data’.
Nigel Shadbolt was also at the news linked data summit, giving his vision of what a linked data future might look like – including examples of a ‘post code newspaper’, a mash-up of cycle route blackspots, and a clever illustration of how our income tax gets spent.
Martin Belam, of the Guardian and currybet.net, talked about the value of linked data to news organizations (which you can read on the Guardian blog here), and Richard Wallis, of Talis, gave an overview of where news organizations are now in terms of linked data and metadata standards (see Richard’s presentation here).
Those at the day included us (the Media Standards Trust), and people from the BBC, the Guardian, the Times, News International, the Telegraph, the Associated Press, Thomson Reuters, the Press Association, the New York Times, the FT, the Mail, the Newspaper Licensing Association (NLA), and the Association of Online Publishers (AOP).
The upshot was: everyone agreed that linked data could, potentially, be pretty exciting. It could enable much better and broader linking, it could help people discover the provenance of data, it could enable news to evolve much more dynamically than it does now, it could even do good things for SEO (though that’s a master art I won’t even try to figure out).
There was general agreement that the “One Ring To Rule Them All” approach doesn’t generally work on the web. In other words, you’ll never 100% agreement between organisations on which things are actually events or concepts, so the best you can do is to try and provide some mapping where sensible.
Therefore there would, inevitably, be multiple vocabularies and multiple places to link. Although one could imagine some sources being ‘canonical’, i.e. they become the default reference for most linked data. A good example of this would be the names of UK schools. One could imagine, for example, their being a list of these at the department of education website which would act as a sort of central repository.
There was also agreement that it would be a good thing if people started dipping their toe in the water. No-one is going to know how valuable – or not – linked data is without giving it a try.
For some of the news organizations the forthcoming general election seemed like a good place to start. There could be a lot of public value in linking, for example, parliamentary candidates.
If you want to know more about the day, or keep in touch with the progress of linked data and news, you can contact me at martin DOT moore AT mediastandardstrust DOT org.