Nov 21 2007
Is the Semantic Web Destined to be a Shadow?
Current practice in the Semantic Web community is leading to the creation of a shadow web that is becoming disconnected from the web of documents. This fracturing is being caused by the W3C’s decision to restrict the types of resources that can be addressed directly with HTTP .
Rob McCool points out in his Rethinking the Semantic Web article that much of the W3C’s Semantic Web activity goes to promoting the creation of separate RDF documents, creating a “shadow web” largely invisible and inaccessible to the bulk of users. Because few humans traverse and explore this shadow web, and because the documents require significant technical understanding of the RDF model, there is no significant ability to validate or affirm the relevance of the metadata being expressed. In some respects the nascent web of data is experiencing a golden age where all data is created with the best intentions. However, my prediction is that were this web of data to become visible in a major search engine it would become another vector for spam to attack search results. Today’s spammers are very much more evolved after a long arms race with the Web search engines, and the semantic web community vanishingly small compared to that of the wider Web. It is an open question as to whether this shadow web could ever survive this hostile environment. In the W3C’s classic layer cake diagram of the Semantic Web, the topmost layer is “trust”, and its positioning indicates that it will be the last component to be built, once all the mechanics are in place.
My belief is that trust must be considered far earlier and that it largely comes from usage and the wisdom of the crowds, not from technology. Trust is a social problem and the best solution is one that involves people making informed judgements on the metadata they encounter. To make an effective evaluation they need to have the ability to view and explore metadata with as few barriers as possible. In practice this means that the web of data needs to be as accessible and visible as the web of documents is today and it needs to interweave transparently. A separate, dry, web of data is unlikely to attract meaningful attention, whereas one that is a full part of the visible and interactive web that the majority of the population enjoys is far more likely to undergo scrutiny and analysis. This means that HTML and RDF need to be much more connected than many people expect. In fact I think that the two should never be separate and it’s not enough that you can publish RDF documents, you need to publish visible, browseable and engaging RDF that is meaningful to people. Tabular views are a weak substitute for a rich, readable description.
Keeping metadata visible and auditable by humans is one of the key principles of the microformats movement. Tantek described the process as one where:
Authors readily saw mistakes themselves and corrected them (because presentation matters). Readers informed authors of errors the authors missed, which were again corrected. This feedback led to an implied social pressure to be more accurate with hyperlinks thus encouraging authors to more often get it right the first time. When authors/sites abused visible hyperlinks, it was obvious to readers, who then took their precious attention somewhere else. Visible data like hyperlinks with the positive feedback loop of user/market forces encouraged accuracy and accountability. This was a stark contrast from the invisible metadata of meta keywords, which, lacking such a positive feedback loop, through the combination of gaming incentives and natural entropy, deteriorated into useless noise.
This is akin to the many eyes principle of the open source movement. Making metadata a visible and integral part of the web page was the principle motivation that led to me developing embedded RDF and is an important consideration in the design of RDFa. The importance of the existing web to the nascent Semantic Web is also underlined by the W3C’s recent standardization of GRDDL which allows pre-existing documents to be transformed into RDF.
However, there is a problem to this coexistence and it’s forced by the W3C TAG’s notions of Information Resources and the httpRange-14 decision on the types of resources that can be addressed with HTTP. As I pointed out in my recent Fragmentation post, there is strong pressure towards using URIs with fragment IDs to represent “non-information resources”.
The dogma that URIs without fragment identifiers must be restricted to document-like resources pushes people into using URIs like http://www.w3.org/People/Berners-Lee/card#i to denote things that aren’t documents. However, there’s a big problem with this: in reality the fragment identifer is associated with representations of resources, not the resources themselves. The fragment identifies a portion of a representation obtained from a URI, and its meaning changes depending on the type of representaion.
In the Web Architecture it is impossible to get a representation of http://www.w3.org/People/Berners-Lee/card#i so you have to get a representation of http://www.w3.org/People/Berners-Lee/card and hope it contains information about the resource you’re looking for. However, the Web Architecture also forbids you from serving up both HTML and RDF documents at that URI that refer to the same fragment id. You can have a machine readable RDF version or a human readable HTML version but not both at the same time. Ever Unless you really did mean to refer to an HTML document fragment. If you’re a mere mortal reader, rather than an RDF guru, then you can’t find out what http://www.w3.org/People/Berners-Lee/card#i denotes because of that single hash character!
The inevitable consequence of this dogma is the statement I opened with: current practice is leading to the creation of a shadow web that is becoming disconnected from the web of documents. Pushing the web of data further away from people is very dangerous with far reaching consequences for the success (or not) of the Semantic Web, especially when the spammers get involved.
17 Responses to “Is the Semantic Web Destined to be a Shadow?”

I very much agree with your side point that we have to consider trust now, not later. One of the main reasons why the meta element has fallen by the wayside was abuse and many has predicted that the same will happen to the semweb. I have countered with the point that it is on the radar and relatively easily addressed since the social fabric has a pretty nice graph shape…
But I’m not sure we devote enough attention to it, and we have a few bad publicity disasters coming if the first high profile use of semweb technologies is spamming.
So, without commenting on your main points, it is certainly important to build the trust component early.
Just to make sure I catch you well. Are the current practice in Linking Open Data project, such as dbpedia.org, geonames.org, or lingvoj.org to name a few, disconnecting the Web of data of the mainstream web? Having for each resource a dedicated html page linked to and from its rdf match seems to ensure the connection. For example geonames data are already used by mainstream web applications through pre-semantic-web services, and the RDF data are just another layer put on the top of those.
Does your analysis applies to URIs such as http://dbpedia.org/resource/Victor_Hugo, http://sws.geonames.org/3033123/, orhttp://www.lingvoj.org/lang/fr?
“You can have a machine readable RDF version or a human readable HTML version but not both at the same time. Ever.”
If you have #chapter3 in the HTML variant, and the RDF variant states that #chapter3 is a section of a document, then both variants are consistent and there’s no problem. So much for “You can’t have RDF and HTML versions at the same URI, ever.”
You also failed to point out that
a) the problem is non-existent if you do like almost everyone and don’t use content negotiation,
b) there’s a readily available and 100% webarch-compatible workaround (303ing from card to card.html if the client requests HTML), and
c) folks in the TAG are aware of the issue and are exploring options to address it.
About that alleged “shadow web”. Comment spam and linkfarms and splogs run rampant even though formatted in readily accessible and easily visible HTML. HTML doesn’t guarantee eyeballs, and eyeballs don’t protect from spam attacks. Furthermore, without appropriate tools in users’ hands, embedded RDF is just as much in the shadows as RDF linked in a separate document. Finally, Atom and RSS are part of the shadow web and still work remarkably well.
Bernard: of course the HTML representations of those links you provided are linked to their RDF counterpart. I think the problem here is that from what you see you can’t reach and re-use the data. If e.g. the site of a TV channel were to publish its programme on their site and also publish it in RDF linked via rel=”alternate” from that, then it would still not possible to click on a broadcast date on the page and get the data behind it imported into your calendar. With embedded semantic data it is.
Moving out of the shadow with RDFa…
… Ian Davis has written an interesting series of posts related to the problems arising from using fragment identifiers in resource URIs [...] I think that conclusion is not correct……
[...] writes in response to my post on shadow webs: eRDF re-uses HTML’s @id to establish resource identifiers, so it mixes document identifiers [...]
Richard: yes of course in that case clearly the fragment in unambiguous between HTML and RDF. But not if I served SVG from that URI. I’ll update my post though.
As to your other points:
a) content negotiation isn’t the problem. People avoid it partly because of the ambiguity of fragment identifiers. If we weren’t trying to press them into service as identifiers for abstract resources then conneg would work pretty much as specified (aside from proxy issues). Being able to negotiate the format you want is pretty fundamental to REST.
b) the point of this series of posts is that we’re complicating the web architecture by compensating for the broken nature of fragment identifiers. We can keep applying patches or we can simplify it. The 303 decision is a poor patch.
c) I don’t think the fact that the TAG are looking into it means we shouldn’t debate it. That kind of appeal to authority doesn’t sit well with me.
[...] my shadow web post hasn’t convinced you then try this thought [...]
[...] reminded of this by Ian Davis’ post Is the Semantic Web Destined to be a Shadow?: My belief is that trust must be considered far earlier and that it largely comes from usage and [...]
[...] today! All I wanted to write about right now is the “shadow” metaphor. It crops up in Ian’s posts, and he cites Rob McCool’s writings. Since I’m unequiped with an IEEE login, I’m [...]
[...] Davis has an interesting post about the Shadow Web that may be created if we don’t bridge the existing (”clickable”, as I like to [...]
Ian: Thanks for the correction.
It’s not hard to serve HTML and SVG variants from the same URI and use the same fragments in a consistent way. You’ll have a hard time demonstrating any real-world breakage in that case. Fragments in almost all formats identify “parts of the document”, so they identify fundamentally similar things. The breakage only occurs with RDF, because with RDF, fragments can identify things that are fundamentally not parts of the document.
And of course content negotiation is the problem. If there was no content negotiation, the entire problem would go away. Hashless URIs would be documents, hash URIs would be things inside the document (described in the document, or parts of the document, depending on the format). It couldn’t be simpler. Personally, I’d rather dump content negotiation from the web architecture than fragment identifiers. Content negotiation is nice in theory but doesn’t work well in practice. Fragment identifiers are messy in theory, but work well in practice.
b) Yes, 303 is a poor patch. c) Fair enough.
I believe universal learning systems represent a way to bootstrap the semantic web; for example see http://complearn.org/ for one approach.
Who is afraid of the GGG?…
The GGG, or Giant Global Graph. It sounds like something with which you might terrify a child at bed time, but this is no Gruffalo, no Jabberwock, no Smaug. Rather it’s father-of-the-web Tim Berners-Lee’s label for his latest attempt……
[...] Not demonstrations of concept in the laboratory and behind the firewall. Not the creation of a shadow web. This. So thank you, Tim, for reminding us. That said, might Nova’s ‘semantic [...]
[...] been doing some more digging on my fragmentation and shadow web themes and came across something I hadn’t really seen before or, if I have, has been [...]
[...] Is the Semantic Web Destined to be a Shadow? [...]