March 31, 2004

FAQ: How do I parse RDF?

Many times, application developers ask how they can get RDF data from the semantic web into their application, from the recommended syntax RDF/XML. This usually ends up being a question about parsing syntaxes and APIs in certain languages. There are widely available, mature and standards-compliant open source parsing libraries available for most high level programming libraries that application developers might need. This article has provides a summary of what are good choices and up-to-date.

The simple answer is: use one of the readily-available parsers that open source developers in the community have provided. There is no need to create a new parser for the most commonly used application languages, in a similar way to how XML parsers and APIs are widely available.

The more detailed answer depends on the application programming language being chosen (or if a new project, this might influence that choice), as well as the licensing of the project. Most of the items listed below are in easily reusable form and are used in commercial applications. Finally, there is a question of the syntax details - does the system support the latest RDF/XML Syntax Specification (Revised) W3C Recommendation of February 2004.

The list below covers what is available and what I recommend when people ask for commonly used languages for web sites.

ARP2 Parser by Jeremy Carroll
A Java parser developed 2001-present (ARP1) and part of the Jena Java semantic web toolkit. Passes all the RDF/XML tests, provides a lot of validation and internationalisation support and mature. BSD License with advertising.
Drive by Rahul Singh
A C# parser for the ECMA CLR platform developed over 2002-2003. Passed all positive RDF/XML test suite in 2003 but not clear if it does at present. Some third-party reports of problems. GPL license.
RAP RDF API for PHP by Chris Bizer
A PHP library including an RDF/XML parser developed 2002-present. Passes all the RDF/XML tests and mature. LGPL License.
Raptor by Dave Beckett
A C library developed 2001-present with APIs via Redland in several other web languages: Java, Perl, PHP, Python, Ruby and Tcl. Passes all RDF/XML tests and mature. GPL/LGPL/MPL License.
RDF::Simple by Jo Walsh
A Perl parser in CPAN, a translation of rdfxml.py below developed recently and not complete. Does not return all the information about literals (language, datatypes) or with some details of blank nodes.
RDFLib by Daniel Krech
A Python RDF toolkit including parsers for several RDF syntaxes developed over 2002-present and mature. Passes all the RDF/XML tests. BSD license with advertising.
rdfxml.py by Sean B. Palmer
A Python parser in under 10K of source. Designed to be small and as complete as possible. It passes most of the RDF/XML test suite but has not been updated to do the later revisions. GPL License / W3C License (alternate version).
Rio RDF Parser by Aduna
A Java RDF/XML parser part of the Sesame Java toolkit developed in 2003 as a small and fast parser requiring only SAX2. Passes all the RDF/XML tests. LGPL License.

For the state of the tools that have been run against the RDF/XML tests, see the RDF Core Test Results.

Several of the parsers above also provide support for other RDF syntaxes such as N-Triples, as used by the RDF test cases, Notation 3 (N3) and other subsets of N3 and experiments such as Turtle.

There are also several other older, unmaintained software or ones with unknown state against the tests that I have no detailed personal knowledge of: Injectilo (XSLT), Profium (Perl and Java, commercial), libwww (C, old), Snail (XSLT, old, slow) RDF Filter (Java, old), Repat (C, old), SWI-Prolog (Prolog), XWMF (Tcl, old) W3C Perllib (Perl).

Posted by dbeckett2 at 01:44 PM | Comments (0)

March 30, 2004

FAQ: Using RDFS or OWL as a schema language for validating RDF

Many software applications need the ability to test that some input data is complete and correct enough to be processed, e.g. to check the data once so that access functions will not later on break due to missing items. This is commonly done by using a schema language to define what "complete and correct" means in this, syntactic, sense and a schema processor to validate data against the schema.

Developers new to RDF can easily mistake RDFS as being a schema language (perhaps because the 'S' stands for schema!), they then get referred to OWL as providing the solution and then get surprised by the results of trying to use OWL this way.

This is a big topic which we'll just touch on here. In this FAQ entry I just want to illustrate a few of pitfalls and hint at why this is harder than it looks in the hope that it might reduce the "unpleasant surprise" for developers new to OWL.

To spoil the punch line, there isn't yet a really good schema solution for semantic web applications but one is needed. OWL does allow you to express some (though not all) of the constraints you might like. However, to use it you may need an OWL processor which makes additional assumptions relevant to your application - a generic processor will not do the sort of validation a schema-language user is expecting.

The problems arise from fundamental features of the semantic web:
- open world assumption
- no unique name assumption
- multiple typing
- support for inference

Let's look at a few examples of schema-like constraints you might want to express:

1. Required property

Suppose you want to express a constraint something like "every document must have an author". You might say something like:

  eg:Document rdf:type owl:Class;
              rdfs:subClassOf [ a owl:Restriction;
               owl:onProperty     dc:author;
               owl:minCardinality 1^^xsd:integer].

  eg:myDoc rdf:type eg:Document . 

You might think that if you asked a general OWL processor to validate this it would say "invalid" because eg:myDoc doesn't have an author. Not so. The OWL restriction is saying something that is supposed to be "true of the world" rather than true of any given data document. So seeing an instance of a Document an OWL processor will conclude that it must have an author (because every Document does) just not one we know about yet. So in fact if you now ask an OWL aware processor for the author of myDoc you might, for example, get back a bNode - an example of the inferential, as opposed to constraint checking, nature of OWL processing. This also fits in with the open world assumption - there may be another triple giving an author for myDoc "out there" somewhere.

Of course, even though general OWL processors behave this way doesn't prevent one from creating a specialist validator which treats a document as a complete closed description and flags any such missing properties - it is just that a generic OWL reasoner probably won't do this by default.

2. Limiting the number of properties

A related example is expressing the constraint that "every document can have at most one copyright holder".

  eg:Document rdf:type owl:Class;
              rdfs:subClassOf [ a owl:Restriction;
               owl:onProperty     eg:copyrightHolder;
               owl:maxCardinality 1^^xsd:integer].

  eg:myDoc rdf:type eg:Document ;
           eg:copyrightHolder eg:institute1 ;
           eg:copyrightHolder eg:institute2 .

Again if you ask a general OWL processor to validate this set of statements you might expect it to complain that there are two values for eg:copyrightHoder. Not so. In this case, the problem is the unique name assumption. On the web two different URIs could refer to the same resource and there is no defined way to tell this. Unless there is an explicit declaration that eg:institute1 and eg:institute2 are owl:differentFrom each other then there is no violation.

Indeed, just like in the first example, what an OWL processor does is the reverse. Instead of noticing a violation it inferrs additional facts which must be true if the data is consistent, in this case it would infer:

       eg:institute1 owl:sameAs  eg:institute2 .

Again, a specialist OWL processor could be told to make an additional unique name assumption to handle such cases but that is not a good thing to do in general. In fact, using such cardinality constraints (e.g. in the guise of owl:InverseFunctionalProperty or owl:FunctionalProperty) to detect aliases is a powerful and much used feature of OWL.

Life is a little easier if one is dealing with DatatypeProperties because you can tell when two literals are distinct (well even this is hard when you are looking at different xsd number classes but at least strings are easy!).

3. Type constraints

The third common schema requirement is to the limit the types of values a given property can take. For example:

  eg:Document rdf:type owl:Class;
              owl:equivalentClass [ a owl:Restriction;
               owl:onProperty     eg:author ;
               owl:allValuesFrom  eg:Person ].

  eg:myDoc rdf:type eg:Document ;
           eg:author eg:Daffy .
  eg:Daffy rdf:type eg:Duck.

  eg:myDoc2 eg:author eg:Dave .
  eg:Dave rdf:type eg:Person .

Does the myDoc example cause a constraint violation? No. In RDF an instance can be a member of many classes. Unless we are explicitly told that the classes eg:Duck and eg:Person are disjoint then all that happens with the myDoc example is that we infer that eg:Daffy must be a Person as well. Again a specialist processor could be developed to flag a warning in cases where an object is inferred to have type which is not a known supertype of its declared types; again this would be making additional assumptions not warranted in the general case but useful for input validation purposes.

Having got the hang that OWL is more about inference that constraint checking then what about myDoc2? Should the OWL processor infer that myDoc2 is a Document. After all we defined a Document this time using a complete, rather than partial, definition - so that anything for which all authors are Persons should be a document and the author of myDoc2 is a person. The answer, again, is "no". Just because all the authors we see happen to be people doesn't mean there aren't more authors for myDoc2 that we don't know about.

4. Value ranges

Another common schema requirement is to limit the range of a value. For example to say that an integer representing a day-of-the-month should be between 1 and 31.

Data ranges are not part of OWL at all.

You can express them within XML Schema Datatypes. You could declare a user defined XSD datatype which is an xsd:integer restricted to the range 1 to 31.

There is a problem that XML Schema doesn't define a standard way of determining the URI for a user defined datatype and the RDF datatyping mechanism requires all datatypes to have a URI. This will hopefully get "clarified" and in any case there is a de facto convention which is straightfoward, used by DAML and supported by toolkits so in the meantime we can be non-standard but get work done.

It also slightly less useful that it seems since the RDF datatyping machinery requires that each literal value have an explict datatype URI - you can't just give a lexical value and use range constraints to apply the type.

These caveats aside, the xsd user defined datatype machinery is useful and this is the one place where RDFS on its own, without OWL, can do some validation. An RDFS processor should detect if the lexical form of a typed literal does not match the declared datatype.

5. Complex constraints

The final forms of constraints that come up are ones which involve constraints between values. For example, that a pair of properties should form a unique value pair, or that the value of one datatype property must be less than another property of the same resource, or of a related resource.

No such cross-property constraints can be expressed at all OWL.

Posted by dreynold2 at 02:11 PM | Comments (0)

March 26, 2004

Announcing SKOS-Core 1.0 RDF Schema for Thesauri

The SKOS-Core 1.0 schema can be found at

http://www.w3.org/2004/02/skos/core

The SKOS-Core 1.0 Guide accompanying the schema can be found at

http://www.w3.org/2001/sw/Europe/reports/thes/1.0/guide/

Also, the website for the SWAD-Europe Thesaurus Activity has moved to

http://www.w3.org/2001/sw/Europe/reports/thes/

SKOS stands for Simple Knowledge Organisation System. The Goal of SKOS-Core is to provide a framework for bringing existing knowledge organisation systems such as thesauri and the semantic web together.

SKOS-Core exploits the features of RDFS and OWL to provide a flexible and extensible framework within which different types of KOS can interoperate. SKOS-Core is ideal for modelling thesauri, and can cope with the variations commonly found in thesaurus design and structure.

Posted by ajmiles at 05:23 PM | Comments (0)

March 23, 2004

WP5 - MathML Use Case

For the RDF, Web Ontology, Logic and Mathematics part of SWAD-Europe's workpackage 5 (relating Semantic Web Technologies to XML technologies), I have written an XSLT transform that evaluates MathML equations on "CompanyReport" files, as provided by danbri. The stylesheet works on a Content MathML file (e.g. mmlrules.mml) containing computation rules, such as rna = opoa*na/100:
  <apply>
    <eq/>
    <ci>rna</ci>
    <apply>
      <times/>
      <apply><divide/><ci>opoa</ci><ci>na</ci></apply>
      <cn>100</cn>
    </apply>
  </apply>
The stylesheet retrieves a file (e.g. reports.xml) containing the values of the variables and computes the appropriate results and prints them to standard output: atr = 1.72139269716302 ca = 52798 etc. Currently the stylesheet only supports +-/*. While it's easy to add more template to support more of MathML, it's less easy to actually perform complex operations. However using exslt might allow coverage of many ops.
Posted by mf at 06:51 PM | Comments (0)

March 04, 2004

W3C Technical Plenary - Semantic Web Interest Group meet

Several members of the SWAD-Europe team are at the W3C All Groups and Technical Plenary meeting near Cannes this week. Many W3C working groups are meeting face to face including the new Semantic Web Best Practices working group on thursday and friday. I have been attending the Semantic Web Interest Group meeting for the first two days; the Semantic Web interest Group is the renamed and reconceptualized RDF Interest Group, chaired by Dan Brickley, who is also the director of SWAD-Europe. I summarise here a few of the of the many interesting topics of discussion over these two days, which were a mixture of discussions, presentations and lightening talks. These are just some of the things that struck me - the meeting was public and detailed logs (day 1, day 2) of presentations are available, plus links to the presentations and documents discussed (day 1, day 2), and I'm sure others will have their own comments to make.

Semantic Web Best Practices and Deployment (SWBPD) Working Group

Guus Schreiber summarised the scope of the new Semantic Web Best Practices working group, which is meeting on thursday and friday this week. Before his talk there was a constant refrain of 'that sounds like a job for best practices!' but the charter is reasonably constrained, covering

  • helping people publish existing vocabularies and ontologies which are public, used, royalty free, and are the product of consensus
  • the production of FAQs
  • a list of tools and demos
  • links to other standardization efforts

Most of the work will be focussed on producing W3C Notes; small taskforces with short lifespans will be set up to tackle particular issues and produce the notes. Non-W3C members may be invited to help. A publically readable mailing list for the working group is available, and a homepage.

These aims are very close to the focus of SWAD-Europe, and it's very pleasing to me personally to see further work at W3C in this direction; the feeling of the meeting also seemed to be very positive. The first meeting is this week (agenda).

DAWG - Data Access Working Group

Eric Prud'hommeaux introduced the new working group on RDF query and data access, for which he will be the W3C staff contact: Dan Connolly is the chair (there will also be a co-chair). Here's the charter, and an excerpt from the scope:

The principal task of the RDF Data Access Working Group is to gather requirements and to define an HTTP and/or SOAP-based protocol for selecting instances of subgraphs from an RDF graph. The group's attention is drawn to the RDF Net API submission. This will involve a language for the query and the use of RDF in some serialization for the returned results.

A crucial first step for the group will be to obtain usecases, testcases and requirements. Particularly important here will be the relationship between this work and XQuery. Rules, update, RDF Schema and OWL semantics, and cursors and proofs are out of scope for the group.

Communication and collaboration tools

There was some discussion of the way in which the members of the interest groups interact. Currently people use the rdfig mailing lists (rdf-interest, rdf-rules, rdf-logic, rdf-calendar) or the #rdfig IRC channel, and these tools provide different kinds of interactions in the community. Email provides more continuity and context via threading; IRC with logging and weblog provides immediacy of interaction and a way to share links. Scheduled IRC chats have been used to talk about calendaring, images and geo, but these can work badly for those with English as a second language, or when people are very distributed around the world (Yoshio, Charles), and while they are useful for making fast, small decisions, a higher-level, architectural view is more difficult in that environment (Dirk). Weblogs, rss (and planet rdf), and the wiki are also very useful; audio/video are other possibilities.

RDF in html and alternative rdf syntaxes

The HTML working group joined us for an hour and presented a possible syntax for RDF in xhtml, probably for xhtml2 (Mark Birkbeck). It looks like a very plausible approach. Jeremy Carroll presented work he and Patrick Stickler have been working on - an alternative syntax for RDF called TRIX, processible by XML tools, and including names for graphs. Dan Connolly presented GRDDL, a mechanism for encoding RDF statments in xhtml and XML for extraction by XSLT tools. GRDDL could be used with the html group's proposal to generate RDF.

RDF and images

Closest to my own heart (apart from perhaps calendaring) was the short discsussion on RDF and images. Kendall Clark and I both did lightening presentations on this topic, mine covering some of the discussions (weblog entry) we have had in creating and combining vocabularies for image description, and some demos of the various tools for annotating parts of images. Kendall demonstrated the Mindswap java tool for annotating images with arbitrary ontologies. Both of us talked about the need for UI tools that help with what some people have called 'referential integrity' - in this case, being able to search for a person's name, and use the tool to map that to an identifier for the person, without having to type in the identifier by hand, and (preferably) regardless of misspellings. Both mine and the Mindswap tools use access to remote RDF datasources to do this.

Other talks

Other presentations included Danny Ayers' XOW (winner of 'best slide of the meeting'), semantic blogging 'knobot' by Reto Backman , Jos de Roo on SWIG implementation experience in Euler, a presentation on WSDL and Semantic Web Services from Bijan Parsia. Corese : an RDF engine based on Conceptual Graphs (Olivier Corby), report on the SWAD-E scalability workshop (Dave Beckett), Annotea: location independent references to resources (José Kahan), Nokia Semantic Web server (Patrick Stickler), Using RDF Datatypes (Graham Klyne), Modelling Context using Named Graphs (Chris Bizer and Jeremy Carroll), Tell me about that URI (Dirk-Willem van Gulik), and Dan Connolly and I talking about calendaring, Danny Weitzner on privacy (The Transparency Paradox').

Posted by lmiller2 at 12:54 PM | Comments (0)

Thesaurus FAQ Entry: How can I make my thesaurus a part of the Semantic Web?

To make a thesaurus a part of the semantic web, simply

  • encode the thesaurus as RDF using the SKOS schemas,
  • publish the RDF data.
The SKOS schemas are RDF schemas for encoding thesauri and similar types of knowledge organisation system (KOS).

SKOS-Core is the core schema, allowing representation of thesaurus concepts, terms, and organisation of those concepts into hierarchical and associative structures. It has been designed as an extensible framework of properties, and so can be adapted to cope with different types of thesaurus.

The version of SKOS-Core currently available is a pre-release, and a good introduction to using the schema can be found here. A formal release (version 1.0) is planned shortly, along with a guide to using it - watch this space!

SKOS-Mapping is an RDF schema for creating and encoding mappings between thesauri. If mappings between thesauri are available, independent but overlapping thesauri can be used interchangeably, helping to remove the boundaries between collections and communities. A good introduction to SKOS-Mapping with examples is here.

SKOS-Mapping is also currently available as a pre-release version. A formal release can be expected shortly after SKOS-Core 1.0.

There are also a number of reports on issues relating to the use of thesauri on the semantic web, including a review of previous work and a report on multilingual thesauri. The work is ongoing, and discussed on the public-esw-thes@w3.org mailing list (archives) - feel free to join in!

Posted by ajmiles at 09:26 AM | Comments (0)