HTTP URIs are not Without Expense

Introduction & Background

The content of this Wiki stems from a thread in the W3C Semantic Web Health Care and Life Sciences mailing list discussing what should be the language used in recommendations regarding authoring URIs for use in Biomedical datasets and ontologies. It is an attempt to paint a clearer picture about the assumption that the HTTP scheme is a one-shoe-fits-all URI scheme (especially WRT the expected behavior of intelligent agents which consume URIs that denote scientific concepts and documents). Relevant links to the recent Technical Architecture Group finding (URNs, Namespaces and Registries) are included for better context.

The Era of Semantic Web Agents

The primary motivation behind this Wiki is the fact that a significant barrier to the advent of Semantic Web automatons (or "intelligent agents") is threefold:

  1. A well-defined normalization function which (when applied against legacy data) results in RDF
  2. A robust policy for identifying additional RDF content relevant to an initial set of RDF assertions
    • Can you clarify? Are you referring to the problem of finding assertions about a resource (other than the URI declaration associated with that resource)? -- DBooth

    • I'm referring to the problem of finding assertions both about specific URIs as well as those about other URIs but still of direct relevance (the FOAF graph scenario) -- Chimezie

  3. A robust policy for identifying a set of semantics (typically in the form of an OWL ontology) which determines the mapping from terms in the initial RDF graph to referents in the world.

    • Do you mean the problem of finding the URI declarations for those terms (assuming they use URIs)? -- DBooth

    • I mean the problem which is essentially the reverse of what the RIF WG is currently contemplating: finding a ruleset or entailment regime to apply to an initial dataset to facilitate interpretation of the dataset. This is orthogonal to URI declarations since the mechanism for declaration is model-theoretic (RDF-mt,owl-semantics, and RIF-RDF) -- Chimezie

It is worth noting that with respect to biomedical ontologies, the set of referents of interest are typically not information resources but objects in the real world of scientific interest [Smith, B. "From Concepts to Clinical Reality: An Essay on the Benchmarking of Biomedical Terminologies"]

The class of GRDDL-Aware Agents (as defined by the GRDDL specification) addresses barrier #1. LinkedData is an attempt to address barrier #2 and is similar (in spirit) to OntologicalClosure, which attempts to address barrier #3. However, both run with the assumptions that 1) a vast majority (if not all) of the terms in the initial RDF are resolvable HTTP URIs and 2) exhaustively resolving representations from these HTTP RDF URIs is generally the most useful mechanism for guiding automatons. There are issues regarding what to make of the content returned upon resolving such URIs (TAG's httpRange-14; see also URI Declaration Versus Use) as well as issues with the feasibility of such an assumption. This Wiki is only concerned with the question of feasibility. This assumption seems rooted in the Web Architecture best practice of "Reuse URI schemes":

However, conversely, Web architecture also suggests as a best practice:

HCLS/WebClosureSocialConvention is an attempt to address the second barrier along lines somewhat divorced from LinkedData. The suggestion there is built around the observation that 1) OWL ontologies tend to consolidate the set of assertions needed to address barrier #3 and thus are better-suited as a more guided trail from syntax to semantics and 2) there are at least 3 well-defined RDF relations that hold between RDF terms and information resources with RDF graphs representations (see HCLS/WebClosureSocialConvention#GraphLink): rdfs:isDefinedBy, rdfs:seeAlso, and owl:imports.

The Usecase

The points are built around the following usecase (not entirely fictional):

The Cost of Assumptions Regarding a Transport Protocol

Jane's concerns about misinterpretation (by web agents) of the use of HTTP-schemed URIs as serving doubly as both identifiers *and* locations is discussed in 4.5 Erroneous appearance of dereferencability of identifiers:

In addition, this concern is cited under the FAQ section on www.tagurl.org (Why not use an http URL instead?):

The Issues

Conflating Web Architecture with Semiotics

One of the less-discussed expenses of an exlusive policy of HTTP URIs for RDF assertions is the Web Architecture responsibilities (adherence to best practices) which are brought to bear with use of the HTTP URI scheme. OWL ontology authors are primarily concerned with crafting a well-constrained model theory for the world. This has everything to do with Semiotics (the use of signs to denote things in the world) and model-theoretic semantics (declaration of constraints on the referents denoted by the use of terms in a formal logic) and nothing to do with resolution of information resources, their representations, and other relevant representations (see httpRange-14 resolution).

Signs behave more like windows into reality:

Credit: John F. Sowa

Whereas, resolvable HTTP URLs are more like entries in a File Allocation Table

The task of shaping ontological commitment (see: Role II: A KR is a Set of Ontological Commitments) using RDF and OWL is difficult in its own right without adding addition burden with best practices of resolving representations from the URIs used. This is especially the case where a majority of the referents denoted are not information resources.

Furthermore the HTTP 303 response code (and hence all usage of PURLs) is a bit problematic from a Semiotic perspective (see: Discussion). In particular, a 303 (per httpRange-14 interpretation) indicates that the representation returned describes something disjoint from what the initial URI denotes.

Surgical Procedure Example

For example: If Jane wants to mint a URI for the physical act of a surgical procedure she can pick a (guaranteed) unique URI easily enough. If she decides to use an HTTP URI it then falls under the jurisdiction (so to speak) of the Architecture of the World Wide Web (AWWW). So, she may have to contend with the following best practices:

.. etc ..

So, she might be compelled to use the following canonical URL to identify the action of a surgical procedure:

Furthermore, she then may be compelled to provide some 'consistent' representation at the web location. Which representation format should she use? Should it primarily be for human consumption (XHTML perhaps) or machine consumption (OWL/RDF)? If it is for human consumption, perhaps her fragment identifiers should be embedded in the representations she serves, etc... Note that so far her considerations have nothing to do the surgical procedure action which takes place over and over again in hospitals. She hasn't yet begun to consider the constraints that apply universally to surgical procedures and their relations to other things of interest in her domain.

Finally, the fact the PURL domain responds with HTTP code 303 (by Technical Architecture Group dictate) indicates that whatever is served from the web location that the PURL domain redirects HTTP request traffic to is not a direct representation of the action of a surgical procedure. Whatever is returned is certainly not more authoritative (or informative) for an automaton than an OWL-DL expression which describes the universal constraints of the action of a surgical procedure. For example:

Class: Process
EquivalentTo: :Proces that 
              ( :isMainlyCharacterisedBy some :performance ) and
              ( :isEnactmentOf some :SurgicalDeed ) and
              ( :playsClinicalRole only :SurgicalRole )

There is no guidance for how to structure your URIs (although some could be given!)

There are existing best practices and literature for authors who are in the business of minting URIs:

However, these are best practices and do not have the same (grammatic & semantic) rigor typically associated with RFC's for URI schemes (such as tag) - each of which have a concrete EBNF for the expected structure of URIs which make use of the scheme as well as clearly defined semantics for their interpretation.

Hash vs. Slash still, in some sense, rages

See: Cool URIs for the Semantic Web which discusses the merits of each approach. In particular: "Choosing between 303 and Hash". The answer is still not definitive, however.

Things like versioning not built in or specced

The most common usage of version information embedded within an HTTP URI are those used for W3C specifications.

An example is the Proposed Recommendation URI for the GRDDL: http://www.w3.org/TR/2007/PR-grddl-20070716/ . This 'suggests' that the document is a Proposed Recommendation published on July 16th, 2007 (via the PR-grddl-20070716 portion of the path component)

However, this is not explicitly governed by the HTTP URI scheme, which only has the following major syntactic components: authority, path, query, and fragment. The most common scenario with HTTP URIs which embed revision metadata is to embed this 'metadata' in the path portion of the URI.

Certain non-HTTP schemes such as tag and LSID have explicit syntactic components for revision (or time-stamping at the very least). Having the syntax and semantics of version explicitly defined in a URI scheme specification (where having such revision information in the identifier is desirable) removes any ambiguity about how to interpret revision.

Authoritativeness can be daunting

In Jane's case, she has *no* authority.

Either you have to maintain a server, or find someone who will

Same point as above

One big document or loads of little documents? Either has downsides

If Jane deployed her terminology using resolvable HTTP URIs she can either put all her terms in a single OWL ontology (which doubles as the baseURI of her URI naming convention) or in the logical equivalent (URI re-writing can suffice) of a single file for each term. Jane's scenario suggests she is better served with deploying a single OWL ontology, since this has the least impact on her employer's web server.

Considered getting hammered by unsophisticated spider users

Semantic Web agents which are programmed to expect to be able to fetch representations from each HTTP URI they come across are likely to attempt to resolve each term they come across, putting a severe load on her employers server. This is especially the case for spiders which don't utilize HTTP caching effectively. The Hash approach resolves some of this concern, but not entirely.

How do you handle disagreement? Is the URI owner the curator or do they maintain their POV?

If the domain is sold things can fall apart

See: 5.2 Persistent Dereferencability (location independence). In particular:

HTTP, RDF, and Fallacies of Distributed Computing

Also, the assumptions (mentioned at the start) do not fair well when compared with the "Fallacies of DistributedComputing":

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Most of the points of "collision" are well-aligned with the TAG finding (URNs, Namespaces and Registries).

HttpUrisAreExpensive (last edited 2007-09-05 00:32:40 by Chimezie)