January 30, 2004

Semantic Blogging Update

The semantic blogging project is officially finished. Code, javadocs, and lessons learnt report are all available. However, some promising semblogging activity continues.

Firstly, the code is being downloaded and played with. Whether this will lead to other, "perhaps even unexpected" uses as I mentioned in the lessons learnt report remains to be seen, but I am hopeful.

Secondly, the bibliographic metadata theme seems to have struck a chord with people like Bruce D'Arcus, who are interested and active in the complex world of bibliographic metadata standards

Thirdly, the ideas are being picked up by the research community, UK Universities and even a startup (about which, perhaps, more anon). I also have a couple of evaluation projects ongoing within HP to move semblogging from an interesting prototype to a usable tool.

For readers wanting to know more, the best bet is the short vision statement I presented at BlogTalk 2003. Other resources (including code) are available on the download page.

The project maintains its own blog, on which snippets and micro-news continue to be posted. However, regular updates will also be posted on this, the main SWAD-E blog.

Posted by scayzer2 at 04:20 PM | Comments (0)

SWAD-Europe Visits English Heritage

On the 29th of January Nikki Rogers from ILRT and myself (Alistair Miles from CCLRC) paid a visit to Edmund Lee and the members of the Data Standards Unit at the English Heritage National Monuments Record Centre. The team at English Heritage have a wealth of experience about thesaurus construction and use. The purpose of the visit was to learn from each other, and to explore how their practical needs relate to the work of the SWAD-Europe Thesaurus Activity.

The visit was a great success, and we were able to break ground on a number of challenging technical issues. There emerged a clear need from the work of English Heritage for distributed access to and development of thesauri, for which at this time there is only a partial and rather heavy weight solution. The development of a thesaurus service, providing access to the functionality of a thesaurus via the internet, would be of real benefit. We hope English Heritage will become involved in our development and implementation of such a service, which is a key part of the SWAD-Europe Thesaurus Activity.

Thanks to Edmund and the team for a warm welcome.

Posted by ajmiles at 02:54 PM | Comments (0)

January 27, 2004

w3photo image annotation work

As part of SWAD-Europe's dissemination efforts, and continuing on the the theme of Image annotation from the workshop we held in June 2002, we have been collaborating with a number of other groups a project to annotate photos from the WWW series of conferences. Also involved are Greg Elin, who came up with the idea; members of the Mindswap group from Maryland, members of the IAM group from Southampton, and others including Jim Ley, Morten Frederiksen, Masahide Kanzaki and Benjamin Nowack. There is a mailing list, semantic-photolist@unitboy.com, that anyone can join, currently archived at www-archive@w3.org archives), and we have held several IRC meetings on the #rdfig channel on freenode.

From the point of view of the SWAD-Europe project, our interests are:

  • discussing and improving vocabularies for annotating parts of an image
  • experimenting with using several different vocabularies together, for example an image vocabulary with a geographical one, a vocabulary describing people, another describing events.

More information about meetings is on the ESW wiki WWW2004; the specification is at W3PhotoSpec, the vocabularies used are documented at: W3PhotoVocabs.

Annotating parts of an image

There are several tools for annotating parts of an image, for example Jim Ley's SVG tool and RDFPic as demonstrated at the SWAD-E image annotation workshop; also Greg Elin's fotonotes, Masahide Kanzaki's javascript image annotator, and a java tool called Photostuff from the Mindswap group.

Discussions on IRC have centred around the requirements for a vocabulary for describing parts of images, which would include

  • simple mapping to SVG and image maps;
  • extensibility to other forms of media later;
  • consistent with being able to describe what a region depicts (compared to what an entire image depicts);
  • ability to assign uris to parts of an image;
  • taking into account vocabularies such as Jim Ley's vocab, Mindswap vocab

An email from Morten Frederiksen outlines the descisions made and the reasons for those decisions. It is hoped that a home for the RDF/OWL vocabulary can be found on the W3C site. Further discussions are likely to occur on the w3photo list and perhaps on a proposed new list

Mixing vocabularies for image annotation

Discussions about mixing vocabularies for image annotation have sometimes occured on the FOAF mailing list as part of the codepiction project and similar efforts by independent programmers. They have often centred around adding geographical information to photos, but also mixing in event information and Dublin Core data, as well as the core 'depicts/depiction' information about who or what is depicted in the photo, using FOAF and Wordnet classes and properties. The w3photo project is also concerned with rights and licensing of the metadata and images for use as sample data for different systems.

Two very interesting issues have come up, at least one of which deserves its own FAQ entry (although at the moment the answer is unlear). The first is:

When you are using terms from another vocabulary, should you use them directly, or create aliases to them using owl:sameClassAs?

This is a very difficult question but I will try to summarise the main points here. Essentially it has to do with technological maturity, processing power, and backwards compatibility. Creating a new OWL ontology with links back to other existing vocabularies means that you get a nice neat description of the vocabulary you want to use, which you can control should the existing vocabularies disappear or redefine their terms. From an OWL point of view, using owl:sameClassAs means there is no difference between using the external classes and properties directly and using the newly defined versions.

However, many tools cannot handle OWL structures; nor even RDFS structures, and so where there is substantial existing data using a pre-existing vocabularies, creating a whole new one reduces interoperability. Even for tools which can handle OWL, extra processing is required to perform the reasoning to link the new terms to the old ones, and so the result of creating a new vocabulary is slower applications. The ESW wiki has a page for notes on this topic.

Supplementary questions include:

  • For a new schema, should I create two schemas (one RDFS, one OWL) for any given vocabulary, or will one do? what would it look like? should it be OWL DL, or OWL Full (or OWL Lite?)
  • How can I describe somewhere the links between existing schemas that I'm making if I don't use owl:sameClassAs or reciprocal rdfs:subClassOf? and relatedly, the second main question:
How can you validate RDF for a particular tool to use?

RDF is not generally concerned with document-level validation, but nevertheless it is very useful to have this level of validation in certain cases. In the w3photo project the aim is to have several different applications which create RDF describing photos, and several different applications which will consume it for display, and document-level validation will be useful for this.

At the document level, decisions need to be made about the minimal set of properties and classes needed so that the consumers can display what is produced: if a consumer can expect that a typical document has no required parts, the effort expended in API calls or queries will be expensive.

Another reason for using document-level validation is the need for 'referential integrity', or at least the hope that at least some references - to people, events - will be consistent between annotations. If I state that this is a picture of 'Bob Smith', that's not a good identifier for the Bob I'm talking about, even within the limited context of the attendees of the WWW2003 Web Conference. If I state that this picture was taken at the conference with the name 'WWW2003', that might normally disambiguate it, but in some circumstances may not (for example there were two conferences with the acronym 'ISWC' going on at the same time in 2003). Where URIs are not directly used to identify things in the world (like people), identity reasoning depends on the presence of particular properties. In this case, identification of any references to individuals - people or conferences - needs to be at the document level.

A final reason is that for the purposes of w3photo project, licenses are essential, and these need to be present for both the metadata itself (the descriptions of the people for example), as well as for the image. These don't need to to be in the same document (because they reference an Image and a metadata document both by their URL) but it is useful in a distributed project like this one to do the checking at the document level, so that the use of any particular server to access the collection of data will produce consistent results.

So, how can you validate RDF documents like this?

Perhaps the simplest way (implemented here) is to use a 'Schemarama'-like system - a set of RDF queries mixed with if-then blocks. For example an Image MUST be present, and MUST have a dc:description. Or a foaf:depicts property MAY be present, and if one is found, it MUST have some identifier for the person depicted, either a hashed version of their email address, or their homepage, or something else.

An interesting question is whether it is possible to use OWL to specify these kinds of document-level constratints. We are hoping that this and some of the other questions discussed here can be anwsered in due course.

Posted by lmiller2 at 03:34 PM | Comments (0)

Thesaurus Activity FAQ

Q: What can thesauri do for the web?

A: Thesauri can enrich the web in several ways.

Thesauri can be used to organise information in a sensible way, which in turn helps us to find what we are looking for on the web. Richer than a simple taxonomy, but simpler than a full blown ontology, thesauri provide a convenient yet powerful way to achieve knowledge organisation. Furthermore, because thesauri have been used for decades by library scientists for the same purpose, there exist a number of extremely well structured, well engineered thesauri in the public domain. Providing the framework for bringing these systems on to the semantic web is a major goal of the SWAD-Europe Thesaurus Activity.

A thesaurus also includes information about terminology, and how different terms may be used to represent different concepts. A thesaurus with rich terminological data can be used to support tasks such as automated classification of documents.

These are two of the ways that thesauri can help significantly reduce the energy barrier that stands before the explosion of the semantic web. By bringing existing knowledge organisation systems into the web, we reduce the effort required in the engineering of ontologies from scratch. And by supporting tasks such as automated document classification, the effort required in generating the metadata that is fundamental to the semantic web is greatly reduced.

Finally, multilingual thesauri provide new opportunities for cross-language interaction via the web.

Posted by ajmiles at 03:10 PM | Comments (0)

January 21, 2004

FAQ entry - rdfs:domain and range

Q. Why do rdfs:domain and rdfs:range seem to work back-to-front when it comes to thinking about the class hierarchy?

A. Because RDFS is a logic-based system. The way rdfs range and domain declarations work is alien to anyone who thinks of RDFS and OWL as being a bit like a type system for a programming language, especially an object oriented language.

To expand on the problem. Suppose we have three classes:
eg:Animal eg:Human eg:Man

And suppose they are linked into the simple class hierarchy:
eg:Man rdfs:subClassOf eg:Human .
eg:Human rdfs:subClassOf eg:Animal .

Now suppose we have property eg:personalName with:
eg:personalName rdfs:domain eg:Human .

The question to ask is this: "can we deduce:
eg:personalName rdfs:domain eg:Man ?"

The answer is "no" the correct such deduction is:
eg:personalName rdfs:domain eg:Animal .

This is completely obvious to anyone who thinks about RDFS as a logic system, however it can be surprising if you are thinking in terms of objects.

A common line of thought is this: "surely [P rdfs:domain C] means roughly that P 'can be applied to' objects of type C, just like a type constraint in a programming language. Now all instances of eg:Man are also eg:Human so we can always apply eg:personalName to eg:Man things, doesn't that mean eg:Man is in the domain of eg:personalName?"

There are two flaws in this line of thought. First, rdfs:domain isn't really a constraint and doesn't mean 'can be applied to'. It means more or less the opposite, it enables an inference not imposes a constraint. [P rdfs:domain C] means that if you see a triple [X P foo] then you are licensed to deduce that X must be of type C. So we can see that if we make the illegal deduction [eg:personalName rdfs:domain eg:Man] then everything we applied eg:personalName to would become a eg:Man and we could no longer have things of type eg:Human which aren't of type eg:Man. Whereas the correct deduction [eg:personalName rdfs:domain eg:Animal] is safe because every eg:Human is an eg:Animal so the domain deductions don't tell us anything that wasn't already true, so to speak!

The second flaw is in the phrasing "is in the domain of". It is true that eg:Man is, in some sense, "in the domain of" eg:personalName but the correct translation of this loose phase is that "eg:Man is a subclass of the domain of eg:personalName" which is quite different from saying ":eg:Man *is* the domain of eg:personalName."

Posted by dreynold2 at 05:28 PM | Comments (2)

Update on semantic portals work

We've started work building a prototype of our semantic portal demonstrator as outlined in [1]. For the first prototype we've got some data from a older UK directory of environment organizations and are developing an appropriate set of ontologies form converting it to RDF.

Like all ontology problems, the task of defining an appropriate ontology for environment organizations just explodes in scale and complexity as soon as you touch it. First we just wanted a broad organizational type but soon found that it would be useful to have a more precise ontology for legal status. A few conversations with a lawyer led to a useable small ontology for legal status, for the UK at leas,t it its already too complex to expect many users to want to work with. So we had to go back to a simplified "colloquial" ontology for organizational type with separate links to a more detailed legal status thesaurus. This approach of having a coarse grained ontology which controls the main information structure, with links to more refined thesaurus terms to fill in the details, seems like a useful design pattern and we hope to repeat it for other facets such as organizational activity.

On the hacking front we are putting together a data entry tool based on a customization of our semantic blogging demonstrator - on the grounds that blogging should be a good way to capture data. For the viewing side we building an aggregator which can merge various semantic blogs and other RDF sources into one repository and provide a faceted browse interface over the repository. We've got an early version of the faceted browse interface going, inspired by projects like Flamenco [2]. It seems a very nice way to browse highly structured RDF data sets and might be worth packaging up as a separate open source tool.


[1] http://www.w3.org/2001/sw/Europe/reports/requirements_demo_2/index.html
[2] http://bailando.sims.berkeley.edu/flamenco.html

Posted by dreynold2 at 03:58 PM | Comments (0)

January 20, 2004

Explaining new features in RDF: the nodeID attribute

This article provides an introduction to the rdf:nodeID attribute which was introduced into the revised RDF/XML syntax by the RDFCore working group. This explanation is intended for RDF and XML developers who have some reasonable familiarity with RDF's XML syntax, and who want to catch up with the new features added to RDF by the RDFCore group. It is not intended as a general introduction to RDF syntax.

Over the last 3 years, W3C's RDF Core working group has been busy modernising the core RDF specifications. In particular, RDFCore have fixed some problems with the original RDF Model and Syntax specification from 1999. One of the most noticable changes was the introduction into the revised RDF/XML syntax of a new attribute, rdf:nodeID, which is used for encoding descriptions of things without using URI strings to identify them. The intent here is to provide a bit more background on this new feature of RDF's XML syntax.

This article is based on a message written originally in an rdfweb-dev thread. The question was originally about the use of rdf:nodeID within people descriptions using the FOAF RDF vocabulary.

RDF has always contained an rdf:ID attribute, dating from the original RDF spec, first drafted in 1997 when XML itself was in flux, XML namespaces barely existed, and DTDs were state of the art. RDFCore added rdf:nodeID. To understand rdf:nodeID, we should talk a bit about the rdf:ID attribute.

One constraint we had on rdf:ID was that it acted like an XML ID attribute, even though formally we couldn't say it was one since RDF didn't require DTD processing. In particular, you can only have one attribute with any given rdf:ID value within your RDF/XML document. The idea was that these were used for linking to things from elsewhere in the Web. An RDF parser, given some document (that has a base URI) will generate full URIs for a node whose XML element is decorated with an rdf:ID, so for example:

<rdf:Description rdf:ID="me">
 <foaf:name>Dan Brickley</foaf:Name>
</rdf:Description>

...if parsed with a base URI of http://example.com/foaf/test1.rdf ...will generate a single triple:

http://example.com/foaf/test1.rdf http://xmlns.com/foaf/0.1/name "Dan Brickley"

Simple enough. So where did this rdf:nodeID thing come from?

Well the basic problem was that 'original RDF', ie. the thing begun in 1997 and made into a W3C RECommendation in Feb 1999, was a bit rough around the edges. A few things weren't clear, for example the notion of so-called "anonymous resources", which we now refer to as "blank nodes" (or bNodes) in the graph. These correspond to RDF descriptions of things where the thing (a person, document, company, whatever...) is _mentioned_ yet not _named_ by specifying a full (or even partial) URI. Aside: we stopped using the term "anonymous resource" in acknowledgment that the thing being described isn't intrinsically anonymous; it may well have a widely known URI, it is just that some RDF files can mention it 'in passing' without giving that URI (ie. rdf:about or rdf:resource might not be used).

So... if you have an RDF graph, and it mentions a bunch of things, and several of those things don't have URIs attributed to them in the graph, ie they are blank nodes in the graph, then you can get a problem. It isn't always possible to write down in '99-era RDF/XML the markup that represents (serializes) that data structure losslessly. You end up inventing URIs for things so you can fit them into the constraints of the RDF syntax.

For example, say you have three people. And assume the world is still squabbling away about angels on pinheads and whether people have URIs, so there is no concensus about whether people 'have' URIs. But you still want to describe them in RDF. So you make an RDF graph with bNodes for the people.

Let's label our 3 people 'a','b' and 'c', noting that these are private, local, transitory etc IDs we're using just so we can talk about them. They're not web-wide IDs that we expect to be widely known (such as URIs).

a worksFor b
a marriedTo c
b fatherOf c

This mini-web of relations can't be serialized in '99-era RDF/XML without inventing URIs for these 3 things, ie. the mere syntax of RDF used to force us to mess around with our data, and change (albeit only slightly) what the data was telling us.

<Person>
  <name>person a</name>
  <worksFor>
    <Person>
      <name>person b</name>
      <fatherOf>
         <Person>
           <name>person c</name>
         </Person>
       <fatherOf>
    </Person>
   </worksFor>
   <marriedTo ... -- what do we write here to link to person c! />
</Person>

So this is just the classic problem of mapping between two data structures, ie. directed labeled graphs (RDF) to trees (XML).

We need a way of crossreferencing the XML element that stands for a to the one that stands for c, and labelling that relationship with 'marriedTo'.

So first let me show you the way that RDF'99 would have us do it. Note the asymmetry: we do something different at each end of the link.

<Person>
  <name>person a</name>
  <worksFor>
    <Person>
      <name>person b</name>
      <fatherOf>
         <Person rdf:ID="c">
           <name>person c</name>
         </Person>
       <fatherOf>
    </Person>
   </worksFor>
   <marriedTo rdf:resource="#c"/>
</Person>

Hopefully you can see an analogy with the old style of HTML linking: the link goes 'from' the marriedTo element, 'to' the Person element. This is like an 'a href' pointer to an 'a name' anchor target in HTML.

Since rdf:ID expands to a full URI, and rdf:resource also expands relative URIs to full URIs, what we have here is just shorthand for this:

<Person>
  <name>person a</name>
  <worksFor>
    <Person>
      <name>person b</name>
      <fatherOf>
         <Person rdf:about="http://example.com/whateverthisdociscalled#c">
           <name>person c</name>
         </Person>
       <fatherOf>
    </Person>
   </worksFor>
   <marriedTo rdf:resource="http://example.com/whateverthisdociscalled#c"/>
</Person>

...and the RDF triples you get back from an RDF parser reflect this. The descriptions of 'a' and 'b' will generate blank nodes in the graph, ie. we get RDF which says, in effect,

"there is a thing and we don't know its URI but anyway that thing has a works for relationship to another thing that we don't know the URI for either but that thing has a fatherOf relationship to yet another thing which is named by the URI http://example.com/whateverthisdociscalled#c and the first thing has a marriedTo relationship to the thing whose URI is http://example.com/whateverthisdociscalled#c". (I've ommitted the 'foaf:name' properties here for brevity).

So, where does that leave us?

Well, firstly we have the problem of the RDF syntax forcing us to invent URIs just so we can use RDF's XML syntax. Worse, the URIs that get invented / assigned confuse the thing being identified with a (part of) an RDF description that happens to mention that thing. Person 'c' from the silly story about would probably be suprised to discover that the Web community were treating http://example.com/whateverthisdociscalled#c as if it were a well known identifier for him/her.

The RDF Core Working Group created rdf:nodeID as a cleanup for this situation, so that now almost all RDF graphs can be round-tripped through RDF parsers/APIs and back into RDF/XML syntax without that process having to make up silly URIs for things in this way.

<Person>
  <name>person a</name>
  <worksFor>
    <Person>
      <name>person b</name>
      <fatherOf>
         <Person rdf:nodeID="c">
           <name>person c</name>
         </Person>
       <fatherOf>
    </Person>
   </worksFor>
   <marriedTo rdf:nodeID="c"/>
</Person>

...is the new look version. It is symmetrical, since the HTML-derrived linking metaphor didn't really work in RDF. Neither XML element is really 'linking to' the other in the sense familiar from hypertext. It is more that they are about the same resource. But RDF's syntax already uses the attribute names 'about' and 'resource', so we made up another somewhat technical attribute name, 'nodeID', whose purposes is to take local-to-this-document identifiers for the things described by our XML elements. Unlike the rdf:about and rdf:resource design, we don't distinguish between the cases where the element stands for a node in the graph (ie. rdf:about) versus stands for an edge in the graph (ie. rdf:resource). This is another belated lesson: the original RDF syntax could have been much simpler, by just replacing both rdf:about and rdf:resource with a single attribute called 'rdf:URI' or somesuch.

But that's water under the bridge. RDF Core fixed the nodeID situation because it was affecting people's ability to write sensible RDF without having XML encoding artifacts interfere with what they were saying. We could have done a bunch more to beautify the syntax, but at cost of greater disruption.

Although rdf:nodeID is relatively new, it is being supported by more and more RDF parsers and serializers, and is relatively easy to add to an RDF toolkit. If you are using an RDF parser or serializer which doesn't support rdf:nodeID, let them know that it's time to upgrade, and perhaps ask here or on www-rdf-interest to see if someone might help with the necessary fixes.

Posted by danbri at 01:00 PM | Comments (0) | TrackBack

January 14, 2004

Report on Semantic Web Storage and Retrieval Workshop

SWAD-Europe Deliverable 3.11: Report on Semantic Web Storage and Retrieval Workshop that organised 13-14 November in Amsterdam hosted by the Vrije Universiteit.

Posted by dbeckett2 at 10:06 AM | Comments (0)