HCLSIG BioRDF Subgroup/Tasks/URI Best Practices/Papers/SWLS

Issues:

The "LSID" name:

Are life science identifiers different enough that they need to be treated separately? Do we then need a physical science identifier, a computer science identifier, etc.?

LSID as a protocol as well as a name:

Similar issue, but one that can also be described as death-by-plugins - if everyone who wants to control a namespace for identifiers makes a new protocol requiring a plug-in...

Persistence policy as part of the name/protocol:

Is persistence such a unique and overriding piece of metadata that it should be part of the name and/or require a separate protocol? Does the name of data change when a researcher decides it is valid and should be kept forever? There seem to be problems analogous to the 'don't encode location in the name because it might move' issue.

Persistence policy as a binary option:

There are many shades of grey in persistence - How long is the guarantee? What happens to data with a 5, 10, or 50 year retention schedule after which is to be deleted? Is access also guaranteed or just unique naming? Is the guarantee best effort? Does it apply to bits or an ‘equivalent’ (by whose definition) item, e.g. the PDF copy of an obsolete MS Word 1.0 document? Is persistence policy handled better as metadata defined by a schema(s)?

Metadata retrieval as part of a persistent identifier protocol:

Is metadata unique to persistent resources? Is there a reason to balkanize metadata access by tying the mechanism to a type of resource? Or should the semantic web provide a mechanism allowing metadata association with ‘any’ resource, persistent or not, via a standard mechanism?

General Commentary:

1) A model for naming resources that a community can agree on is a good / powerful thing; LSID has defined such a model and has a large growing community behind it.

Yes, but… the issues above could limit growth and lead to fragmentation of the community as it raises awareness of what globally unique IDs can do and encourages other “my community’s ID” protocols, and/or modifications that attempt to get around the issues noted above. Will chemists all adopt LSID simply because some of the molecules they work on are related to biology rather than materials science? Will a pharmaceutical company adopt LSID for data with retention schedules?

2) Persistence identification and the ability to persistently resolve names are not artifacts of any technology – they are an organization / community investment. It is unclear what investment the LS community has at this point for supporting resolution services (DNS, HTTP, or other).

Should expectations of persistence shouldn't be managed by naming convention rather than protocol – http://persistent.my.org/ addresses or the use of Handle-style/meaning free URLs (e.g. http://456.10123.name.org/myname - see below). The convention of "www.*" for web servers seems to have worked very well for conveying that expectation that these machines support HTTP.

3) The non-http URI approach requires an extra level of infrastructure for resolving objects. For use in browsers this requires an additional plug-in. There seem to be very few available; and then only on certain browsers. Further I don't think many realize that browsers are perhaps 1/10th of the applications that follow links (e.g. robots, etc. and this is a different issue completely. One the DOI / publishers are unfortunately finding out at this very moment).

A Handle-style proxy mechanism helps a bit here, but it is certainly not as clean/clear as specifying HTTP redirect as *the* resolution mechanism.

4) non-http URIs put barriers up for adoption to other communities. There are reasons (sometimes) to do this, but has this been explored for LSID and the implications understood?

And since science is becoming more interdisciplinary, the protocol really needs to be science-wide or pervasive even if namespaces are controlled by smaller orgs.

5) The LSID community has socially agreed that the use of LSID will point top an immutable resource - the thing one points at will be the same 5, 10, n years later. How can this be enforced socially or technically? What’s the penalty for reusing an LSID? If the LSID, bits to persist, and the hash are all owned by one organization, the bits and hash could be changed together.

This requirement is science-wide - it's been the argument against allowing any URLs as references in the literature, and everyone is moving to treat data in the same way. Life science is ahead in the number of individual data items to be tracked and in how large the community is that needs to persistently refer to things, hence they have the biggest problem right now, but everyone in science (and beyond) has it at some level. Socially, it isn’t clear that LSID provides any more leverage than, for example, a naming convention as in #2. Technically, without a means to make name/hash pairs non-reputable (e.g. by registering them with a neutral third party or using a digital signature), LSID cannot detect reuse of names.

6) It is unclear how best to use LSID; more specifically *when* to use it and when *not* to. There was talk at the meeting of using these for documents, reports, concepts declared on the Semantic Web, etc.

There's a slippery slope here and it will be hard to have a clear convention. I may want to name my raw data, the average of my raw data, a calibrated version of my data, my latest/best data, a graph of my data, the paper about the data, etc. From various discussions of versioning, it is clear that there are use cases that need to name/expose both the individual versions and the 'latest' version, whatever number that currently is, which means bit-level persistence will probably not meet all life-science needs, which may lead to 'abuse' of LSIDs with 0-byte data to refer to things with dynamics.

7) Is LSID bad?

No. The level of adoption of LSID is impressive (though it isn't clear how much of that is simply attaching lsids for future use versus actively producing and consuming them). While the discussions at the Semantic Web for Life Sciences workshop was negative at times, one should not criticize LSIDs without acknowledging that they are a step forward and are definitely enabling and educating the community. However, the semantic web and the life sciences will need more general mechanisms for naming and associating metadata with resources, and a means to provide more detailed persistence information; promoting LSIDs as a short-term solution may not be the best option if progress on these issues can be made quickly.

Potential Alternatives:

Naming:

The Handle System – similar to LSID with its own protocol and resolution mechanism. Used in DOIs. Has a proxy mechanism so no plug-in is required - http://hdl.handle.net/<some-handle> will invoke a resolver service and redirect you to the resource. The Handle System has its own protocol with its own metadata methods and thus shares those issues with LSIDs, its proxy, and the fact that the protocol and namespaces are separate (i.e. the lsid community could organize part of handle space for themselves) seem like advantages over LSID. Handles are also being proposed as part of the Grid naming mechanism (see http://www.globusworld.org/program/abstract.php?id=33, https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-naming-wg/en ).

Persistent URLs – standard URLs maintained by authorities that use HTTP Redirect to provide access to resources. The PURL website has extensive documentations and FAQ information: http://purl.oclc.org

Naming convention only - Use standard URLs and DNS resolution. Resolvers/authorities could be identified via a convention such as addresses starting with “uid”, e.g. http://uid.my.org/. If URIs used as persistent names are “meaning-free” addresses , e.g. http://456.10123.name.org/myresourcename, it would be easy to transfer resolution duties between organizations, i.e. to reassign 10123.name.org from my organization to yours if my org doesn’t want to maintain things anymore. Use redirects as a resolution mechanism.

Metadata:

Protocols such as LSID and The Handle System have their own extensible metadata mechanisms. For URL-based options, there are proposals for ways to add metadata capabilities to URLs:

The Nokia MPUT/MGET/MDELETE methods proposed as part of their URI Query Agent Model (URIQA) ( http://sw.nokia.com/uriqa/URIQA.html). URIQA defines the concept of a Concise Bounded Description of a resource as the set of RDF statements accessible via these methods.

Clark et. al. propose an alternate mechanism using XPointer and HTTP in “A Semantic Web Resource Protocol:Xpointer and HTTP” ( http://www.mindswap.org/papers/swrp-iswc04.pdf).

Persistence Policy:

With any of these naming and metadata combinations, persistence could be treated in the same way as other metadata – statements about persistence policy could be standardized and accessed via the same mechanism used to discover authors, type, creation date, etc.

Additional URLs: Handles: www.handle.net Tim B-L musings on names from '96: http://www.w3.org/DesignIssues/NameMyth.html Meaning-free DNS names: http://www.frankston.com/public/essays/DNSSafeHaven.asp Comparison of Handles and PURLs (by a Handle advocate?): http://web.mit.edu/handle/www/purl-eval.html LSID spec: http://www.omg.org/docs/dtc/04-05-01.pdf

“Persistent Indentification (sic): A Key Component of an E-Government Infrastructure, Updated July 26, 2004” – discusses PURLS and Handles and other alternatives: http://cendi.dtic.mil/publications/04-2persist_id.html