Archive for November, 2007

Nov 30 2007

links for 2007-11-30

Published by Ian Davis under Uncategorized and tagged as

Comments Off

Nov 29 2007

It’s OK to use URIs with Fragments in RDF

Published by Ian Davis under Uncategorized and tagged as ,

I’ve been doing some more digging on my fragmentation and shadow web themes and came across something I hadn’t really seen before or, if I have, has been completely wiped from my mind. The RDF Concepts document contains a whole section on fragment identifiers which is worth reproducing:

RDF uses an RDF URI Reference, which may include a fragment identifier, as a context free identifier for a resource. RFC 2396 [URI] states that the meaning of a fragment identifier depends on the MIME content-type of a document, i.e. is context dependent.

These apparently conflicting views are reconciled by considering that a URI reference in an RDF graph is treated with respect to the MIME type application/rdf+xml [RDF-MIME-TYPE]. Given an RDF URI reference consisting of an absolute URI and a fragment identifier, the fragment identifer identifies the same thing that it does in an application/rdf+xml representation of the resource identified by the absolute URI component. Thus:

  • we assume that the URI part (i.e. excluding fragment identifier) identifies a resource, which is presumed to have an RDF representation. So when eg:someurl#frag is used in an RDF document, eg:someurl is taken to designate some RDF document (even when no such document can be retrieved).
  • eg:someurl#frag means the thing that is indicated, according to the rules of the application/rdf+xml MIME content-type as a “fragment” or “view” of the RDF document at eg:someurl. If the document does not exist, or cannot be retrieved, or is available only in formats other than application/rdf+xml, then exactly what that view may be is somewhat undetermined, but that does not prevent use of RDF to say things about it.
  • the RDF treatment of a fragment identifier allows it to indicate a thing that is entirely external to the document, or even to the “shared information space” known as the Web. That is, it can be a more general idea, like some particular car or a mythical Unicorn.
  • in this way, an application/rdf+xml document acts as an intermediary between some Web retrievable documents (itself, at least, also any other Web retrievable URIs that it may use, possibly including schema URIs and references to other RDF documents), and some set of possibly abstract or non-Web entities that the RDF may describe.

This provides a handling of URI references and their denotation that is consistent with the RDF model theory and usage, and also with conventional Web behavior. Note that nothing here requires that an RDF application be able to retrieve any representation of resources identified by the URIs in an RDF graph.

I’ve been thinking about this for a couple of days and I’m still not entirely sure what to make of it. What it appears to be saying is that RDF ignores the Web Architecture principle that fragment identifiers are given meaning by the representation that is retrieved.

So this ensures that RDF is self-consistent. I can refer to anything I like using a fragment identifier in my URI and I’m guaranteed not to have my intended meaning upset by anything messy like a network operation. This alleviates one of my major concerns at using these kinds of URIs in RDF, but at what cost? If anything this increases my concerns over the shadow web since by circumventing the web architecture it sets RDF further away from today’s web of documents. For example, when I use “http://www.w3.org/TR/webarch/#media-type-fragid” as a URI in my RDF, it probably doesn’t refer to the thing you think it does. You, as a human (if you are), get to see a representation of that section of the document when you click on the link, but an RDF-aware agent must treat that URI as though rdf/xml had been retrieved. Unfortunately there isn’t any RDF there and the Web Architecture actually forbids you from serving up both HTML and RDF documents at the same URI.

What does that mean? How are we supposed to interpret that? One interpretation is that it really doesn’t matter what you do outside of RDF. You can throw up all kinds of other representation formats and it won’t affect yours or anyone else’s RDF. They might use the same identifiers, and occasionally, coincidentally they may identify the same things, but in general RDF is partitioned into its own little world. RDF can only link to RDF.

How can RDF co-exist with other formats on the Web if it ignores their semantics? If you just want the Semantic Web to be built using RDF then you probably don’t care. But if, like me, you want to see an inclusive Semantic Web built from a mix of RDF, microformats, topic maps, RDDL and all the other ways to express semantics, then it’s a very very big problem. I don’t want two webs competing for attention, I want one strong one.

Hence the title of this post. It is OK to use URIs with fragments in RDF, but only if you don’t particularly care about relating to the existing web. If you do care then avoid fragments at all costs. Use standard URIs and stick 303 redirects on them if you need to. It’ll work and the whole web will be better for it.

13 responses so far

Nov 29 2007

links for 2007-11-29

Published by Ian Davis under Random Stuff

Comments Off

Nov 26 2007

Platform at SWIG-UK, Bristol

Published by Ian Davis under Uncategorized and tagged as , , , ,

Last Friday I gave a presentation on the Talis Platform to the SWIG meeting, kindly hosted by HP Labs Bristol. I’ve posted the slides up at our n2 developer community site. They’re not much to look at but I wrote them to be informative rather than suggestive. Although I love seeing beautifully spare presentations being given, they frustrate me when I want to go back and see what the speaker said and find a picture of a butterfly on a flower :)

Nad’s written a good summary over on his blog and managed to capture the questions I was asked at the end too, which is nice to see after the event. The whole day was brilliant with lots of chances to natter and catch up with everyone. I met lots of new people too and everyone seemed to be doing something interesting. I even managed to harangue Stuart Williams off of the W3C TAG on my recent Web architecture posts. Andy said about 50 people were attending despite there being no marketing of the event at all which is a good indicator of the rising popularity of the Semantic Web. Talis are planning to host a SWIG meeting like this in the middle of next year – hopefully we can get more people from the Midlands interested.

There were plenty of other cool presentations on the day too. Nad and Rob were blogging but couldn’t post them live for some reason so check out their sites and Nodalities too in the coming days. I particularly enjoyed Leigh’s talk on Facet, another templating framework for RDF, this time in bog-standard Java allowing the use of JSP and/or Velocity; Richard’s talk on Sindice (which I didn’t get to see at ISWC and for us ignorant Brits is pronounced “sin-dee-chee”); and Graham’s talk on image publication. All great stuff!

Comments Off

Nov 26 2007

links for 2007-11-26

Published by Ian Davis under Uncategorized and tagged as

Comments Off

Nov 25 2007

links for 2007-11-25

Published by Ian Davis under Uncategorized and tagged as

One response so far

Nov 21 2007

Isn’t The Web Built From Links?

Published by Ian Davis under Uncategorized and tagged as , ,

If my shadow web post hasn’t convinced you then try this thought experiment:

You want to link from your webpage to Tim Berners-Lee’s URI <http://www.w3.org/People/Berners-Lee/card#i>, except you can’t because that link points to something that can never contain the #i fragment in its HTML. It can only ever link to RDF because Tim is relying on RDF’s semantics for the meaning of the #i fragment. Tough luck if you can’t read RDF or don’t want to have to learn.

6 responses so far

Nov 21 2007

Things Can Be Simpler

Published by Ian Davis under Random Stuff

Benjee writes in response to my post on shadow webs:

eRDF re-uses HTML’s @id to establish resource identifiers, so it mixes document identifiers with non-doc ones, and this is an ambiguity problem indeed. RDFa, however, is a layer on top of HTML that introduces a dedicated mechanism for resource identification, the @about attribute (, and that’s why it unfortunately needs an own DTD, but that’s another story). From a WebArch POV, the design is clean, content-type-specific identifiers don’t get mixed. I can unambiguously describe what “..ben#self” is meant to identify without the representation format playing a role. RDFa can re-purpose HTML’s text nodes for RDF literals, and anchors for resource URIs, but apart from that, the HTML document is not much more than a (human-friendly) container.

I agree with his point that RDFa has a much better design than eRDF for this. But the purpose of my series of posts is to point out that this is ever so much more complicated than it needs to be. URIs with fragment identifiers are inherently ambiguous because their meaning can change due to a network operation, not simply due to the semantic declarations in RDF or RDFa documents. Performing a GET on a URI can cause my graph to become inconsistent if I was banking on it denoting a train but I got an HTML document telling me it denotes a fragment of a document.

Also, there’s a difference between getting a description of something and getting its representation. Danny touched on this in his blog recently. If I use a URI with a fragment to denote a resource then there’s no predictable way in the web architecture to get a representation of that resource. I have to get the unhashed version and hope for the best. This implies the issue isn’t just about RDF descriptions, but about using HTTP URIs to denote things other than documents in a way that is integral to the current Web.

Comments Off

Nov 21 2007

Reformulating the Web Architecture

Published by Ian Davis under Uncategorized and tagged as , ,

So, accepting that URIs with fragments are generally a broken piece of architecture for the Semantic Web and that information resources are not adding any real substance, here’s how I see the Web Architecture being reformulated for use with the Semantic Web:

  1. A hashless URI should be allowed to denote any resource whatsoever. Documents, books, people, galaxies and unicorns. There is no ambiguity here, the URI denotes a single thing. More than one URI can denote the same thing, so I can have a URI that denotes the city of London, and Danny can have a different URI that also denotes London.
  2. A representation of a resource can be obtained by issuing an HTTP GET on a URI. The representation is a sequence of bits that somehow stands in for the resource the URI denotes. Content negotiation can be used to select an appropriate format for the representation, withouth changing the actual resource being denoted. Perhaps my URI denoting London can respond with an HTML document containing essential facts and figures about the city, a JPEG aerbyial photograph, an SVG streetmap or a sound recording of the sounds encountered while in the city itself. None of these things are London, but they all can stand in for it in some limited fashion. I could retrieve them all to obtain a better sense of London itself, but I cannot actually obtain London using HTTP.
  3. URIs containing hashes are constrained in what they may denote and have an inherent ambiguity due to their reliance on the particular representation obtained. Their denotations vary depending on the URI plus a set of HTTP headers used during the request.
  4. There is no such thing as an “Information Resource”. All resources are made equal. However for many resources, the only representation available happens to be identical to the resource itself. Still, you cannot obtain the actual resource using HTTP, but you can get a copy in the form of a representation. The majority of HTML documents on the web behave in this manner, a single representation that is a copy of the resource itself.

These aren’t huge changes and they’re backwards compatible with the existing web. On the other hand they greatly reduce the reliance on fragment identifiers and they encourage people to use real unambiguous URIs to refer to things other than documents, weaving the Semantic Web right into today’s Web.

For background, you might like to read my earlier posts on this subject:

9 responses so far

Nov 21 2007

What are Information Resources Good For?

Published by Ian Davis under Uncategorized and tagged as , ,

As is probably obvious from my recent posts (e.g. Fragmentation and Is the Semantic Web Destined to be a Shadow?), I’m thinking about the TAG’s httpRange-14 decision again and a large amount of the Architecture of the World Wide Web. The more I think about it, the more I come to believe that Xiaoshu Wang’s formulation is the only one that makes any kind of architectural sense. The foundation of all these issues rests in what was once known as the URI crisis and was boiled down to the question: what kind of things can a HTTP URI identify? The TAG took this up as httpRange-14 which was resolved back in 2005 by introducing the notion of a special class of resources called “information resources”. In his essay What do HTTP URIs Identify? Tim Berners-Lee wrote:

The authors of document <http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf> certainly thought that they could use “http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest” to identify an abstract thing which is a type of software test. Now they have a choice as to what to make the server return for them when I ask for it. It returns 404 “doesn’t match anything we have available”. It can’t really, because HTTP doesn’t allow one to return a class, only a document. And if it were to return a document, then I wouldn’t be able to refer to that document without accidentally referring to the class of negative parser tests.

It seems to me that this essay contains a simple mistake which colours the whole httpRange-14 debate and resolution. It says that the URI can’t return anything because “HTTP doesn’t allow one to return a class, only a document”. That’s true, but it does allow you to return a representation of the class which is a document (or potentially an image, or spoken word audio file). HTTP can never return a resource, just representations of them, i.e. things that stand in for the resource. The final point in that quote suffers from that confusion: of course you can’t use the URI of the resource to refer to its representation. Possibly you could mint a new URI to denote it, but there is no standard vocabulary that I’m aware of that can relate a representation to its resource parameterized by the HTTP request headers and the time. There probably needs to be.

The resolution to all this was to introduce a class of resources that are basically the same as their representations: information resources. The only way to detect if you have an information resource is to GET its URI. If it responds with a 2xx response then the resource identified by that URI is an information resource. Any other response code means it might or might not be an information resource. This rule has a corollary: if you have a non-information resource then you must not respond with a 2xx response, instead you should use 303 to point to an information resource that somehow gives information about the non-information resource.

Information resources are defined in AWWW:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

It’s hard to understand what benefit the introduction of information resources has to the Web architecture. It definitely has drawbacks. For a start it forces all non-information resources off of the web entirely – they’re not allowed to respond with 200 status codes to GET requests. It encourages non-information resources to use of URIs containing fragment identifiers which, as I’ve pointed out, are a very broken piece of architecture and are leading to the formation of a sort of shadow web.

They are also notoriously hard to define. Consider the following resources, remembering to ask whether all their essential characteristics can be conveyed in a message?

  • A cat – no
  • A description of a cat – yes
  • A digital photo of a cat – yes
  • A 35mm film frame containing the image of a cat – no
  • A web page about cats – I hope so!
  • A website about cats – maybe, I guess you could tar it up and serve it from http://www.allaboutcats.com
  • The DNA of a cat – probably no
  • A recording of a cat’s mew – yes, unless its an analogue recording in which case we have a digital approximation of the analogue recording
  • A cat’s mew – no
  • A book about cats – probably not, the book is an abstract work of which there can be multiple editions, revisions, translations, abridgments etc.
  • The class of all cats – I say yes, timbl says no. I can convey the precise definition of a class in a message as an RDF or OWL schema so that seems to satisfy the criteria
  • The members of the class of all cats – no
  • A database of cats – yes
  • A card catalogue of cats – no
  • The name of a cat – yes
  • A taxonomy of cat species – yes
  • The cat character from Shrek 2 and 3 – yes

Remind me again, what’s the point of having this distinction in types of resources….?

35 responses so far

Next »