April 14, 2004

FAQ: How do I validate RDF?

Validation for RDF can mean a variety of different terms especially where RDF is using XML and several layers of technology are connected. This FAQ describes validation for RDF and answers how to do it for the different technologies.

Categories Frequently Asked Questions (FAQs)

Validation is a tricky word to consider, and often used with schema, which can also have several different interpretations. There is validation of syntax (XML validation, RDF/XML - RDF's XML syntax) as well as RDF schema validation.

That means you can do:

  1. XML validation against an XML schema, also called XML schema validation
  2. RDF/XML validation of the syntax that it matches the RDF/XML Syntax Specification (Revised) W3C Recommendation
  3. RDF schema validation

RDF schema validation bears expanding. An RDF schema is a description of the terms used in the RDF triples forming the RDF graph (which can be written in a document, in an RDF/XML syntax). Checking the terms match how the RDF schema describes them is what RDF schema validation typically means. RDF schemas allow description of classes, properties and the ranges, domains of properties and so on. This is explained further in the RDF Vocabulary Description Language 1.0: RDF Schema W3C Recommendation.

There are plenty of tools that do these kinds of things already. The XML validation needs an XML parser, a description of the XML - an XML schema in some XML schema languge and an XML schema validator. The RDF/XML validation needs an RDF parser (see the FAQ How Do I Parse RDF?) The RDF schema validation needs a system that do the checking which formally, is called handling RDFS entailment defined in the RDF Semantics W3C Recommendation.

You can find RDF schema validators in free software RDF toolkits such as Jena and Sesame, Euler (all in Java) as well as in the python Cwm. There are some on-line RDF schema validators such as Rosco - a non-judgemental RDF schema and document checker which does checking, but not all RDFS entailments.

OWL systems which contain OWL reasoners can also typically validate RDF schema since it is a small subset of the more powerful things that OWL reasoners can do, and all RDF is OWL Full. OWL applications that only handle OWL-DL or OWL-Lite cannot check all RDFS entailments.

The W3C RDF validator is an RDF/XML validator, not an RDFS validatior. It is based on the ARP2 RDF parser which is part of Jena. There are many more RDF parsers available that can perform this as already described in the FAQ How Do I Parse RDF?

Finally, you can do XML validation of RDF/XML against an XML schema such as RELAX NG. It's in the RDF/XML specification in section A.1 RELAX NG Compact Schema. The W3C XML Schema language (WXS) is not as suitable as RELAX NG for XML-validating RDF/XML, since it is enforces strong XML constraints and RDF/XML has a wide-open set of tags that may appear.

See also the FAQs:

Posted by dbeckett2 at 11:45 AM | Comments (0)

March 31, 2004

FAQ: How do I parse RDF?

Many times, application developers ask how they can get RDF data from the semantic web into their application, from the recommended syntax RDF/XML. This usually ends up being a question about parsing syntaxes and APIs in certain languages. There are widely available, mature and standards-compliant open source parsing libraries available for most high level programming libraries that application developers might need. This article has provides a summary of what are good choices and up-to-date.

Categories Frequently Asked Questions (FAQs)

The simple answer is: use one of the readily-available parsers that open source developers in the community have provided. There is no need to create a new parser for the most commonly used application languages, in a similar way to how XML parsers and APIs are widely available.

The more detailed answer depends on the application programming language being chosen (or if a new project, this might influence that choice), as well as the licensing of the project. Most of the items listed below are in easily reusable form and are used in commercial applications. Finally, there is a question of the syntax details - does the system support the latest RDF/XML Syntax Specification (Revised) W3C Recommendation of February 2004.

The list below covers what is available and what I recommend when people ask for commonly used languages for web sites.

ARP2 Parser by Jeremy Carroll
A Java parser developed 2001-present (ARP1) and part of the Jena Java semantic web toolkit. Passes all the RDF/XML tests, provides a lot of validation and internationalisation support and mature. BSD License with advertising.
Drive by Rahul Singh
A C# parser for the ECMA CLR platform developed over 2002-2003. Passed all positive RDF/XML test suite in 2003 but not clear if it does at present. Some third-party reports of problems. GPL license.
RAP RDF API for PHP by Chris Bizer
A PHP library including an RDF/XML parser developed 2002-present. Passes all the RDF/XML tests and mature. LGPL License.
Raptor by Dave Beckett
A C library developed 2001-present with APIs via Redland in several other web languages: Java, Perl, PHP, Python, Ruby and Tcl. Passes all RDF/XML tests and mature. GPL/LGPL/MPL License.
RDF::Simple by Jo Walsh
A Perl parser in CPAN, a translation of rdfxml.py below developed recently and not complete. Does not return all the information about literals (language, datatypes) or with some details of blank nodes.
RDFLib by Daniel Krech
A Python RDF toolkit including parsers for several RDF syntaxes developed over 2002-present and mature. Passes all the RDF/XML tests. BSD license with advertising.
rdfxml.py by Sean B. Palmer
A Python parser in under 10K of source. Designed to be small and as complete as possible. It passes most of the RDF/XML test suite but has not been updated to do the later revisions. GPL License / W3C License (alternate version).
Rio RDF Parser by Aduna
A Java RDF/XML parser part of the Sesame Java toolkit developed in 2003 as a small and fast parser requiring only SAX2. Passes all the RDF/XML tests. LGPL License.

For the state of the tools that have been run against the RDF/XML tests, see the RDF Core Test Results.

Several of the parsers above also provide support for other RDF syntaxes such as N-Triples, as used by the RDF test cases, Notation 3 (N3) and other subsets of N3 and experiments such as Turtle.

There are also several other older, unmaintained software or ones with unknown state against the tests that I have no detailed personal knowledge of: Injectilo (XSLT), Profium (Perl and Java, commercial), libwww (C, old), Snail (XSLT, old, slow) RDF Filter (Java, old), Repat (C, old), SWI-Prolog (Prolog), XWMF (Tcl, old) W3C Perllib (Perl).

Posted by dbeckett2 at 01:44 PM | Comments (0)

March 30, 2004

FAQ: Using RDFS or OWL as a schema language for validating RDF

Many software applications need the ability to test that some input data is complete and correct enough to be processed, e.g. to check the data once so that access functions will not later on break due to missing items. This is commonly done by using a schema language to define what "complete and correct" means in this, syntactic, sense and a schema processor to validate data against the schema.

Developers new to RDF can easily mistake RDFS as being a schema language (perhaps because the 'S' stands for schema!), they then get referred to OWL as providing the solution and then get surprised by the results of trying to use OWL this way.

This is a big topic which we'll just touch on here. In this FAQ entry I just want to illustrate a few of pitfalls and hint at why this is harder than it looks in the hope that it might reduce the "unpleasant surprise" for developers new to OWL.

Categories Frequently Asked Questions (FAQs)

To spoil the punch line, there isn't yet a really good schema solution for semantic web applications but one is needed. OWL does allow you to express some (though not all) of the constraints you might like. However, to use it you may need an OWL processor which makes additional assumptions relevant to your application - a generic processor will not do the sort of validation a schema-language user is expecting.

The problems arise from fundamental features of the semantic web:
- open world assumption
- no unique name assumption
- multiple typing
- support for inference

Let's look at a few examples of schema-like constraints you might want to express:

1. Required property

Suppose you want to express a constraint something like "every document must have an author". You might say something like:

  eg:Document rdf:type owl:Class;
              rdfs:subClassOf [ a owl:Restriction;
               owl:onProperty     dc:author;
               owl:minCardinality 1^^xsd:integer].

  eg:myDoc rdf:type eg:Document . 

You might think that if you asked a general OWL processor to validate this it would say "invalid" because eg:myDoc doesn't have an author. Not so. The OWL restriction is saying something that is supposed to be "true of the world" rather than true of any given data document. So seeing an instance of a Document an OWL processor will conclude that it must have an author (because every Document does) just not one we know about yet. So in fact if you now ask an OWL aware processor for the author of myDoc you might, for example, get back a bNode - an example of the inferential, as opposed to constraint checking, nature of OWL processing. This also fits in with the open world assumption - there may be another triple giving an author for myDoc "out there" somewhere.

Of course, even though general OWL processors behave this way doesn't prevent one from creating a specialist validator which treats a document as a complete closed description and flags any such missing properties - it is just that a generic OWL reasoner probably won't do this by default.

2. Limiting the number of properties

A related example is expressing the constraint that "every document can have at most one copyright holder".

  eg:Document rdf:type owl:Class;
              rdfs:subClassOf [ a owl:Restriction;
               owl:onProperty     eg:copyrightHolder;
               owl:maxCardinality 1^^xsd:integer].

  eg:myDoc rdf:type eg:Document ;
           eg:copyrightHolder eg:institute1 ;
           eg:copyrightHolder eg:institute2 .

Again if you ask a general OWL processor to validate this set of statements you might expect it to complain that there are two values for eg:copyrightHoder. Not so. In this case, the problem is the unique name assumption. On the web two different URIs could refer to the same resource and there is no defined way to tell this. Unless there is an explicit declaration that eg:institute1 and eg:institute2 are owl:differentFrom each other then there is no violation.

Indeed, just like in the first example, what an OWL processor does is the reverse. Instead of noticing a violation it inferrs additional facts which must be true if the data is consistent, in this case it would infer:

       eg:institute1 owl:sameAs  eg:institute2 .

Again, a specialist OWL processor could be told to make an additional unique name assumption to handle such cases but that is not a good thing to do in general. In fact, using such cardinality constraints (e.g. in the guise of owl:InverseFunctionalProperty or owl:FunctionalProperty) to detect aliases is a powerful and much used feature of OWL.

Life is a little easier if one is dealing with DatatypeProperties because you can tell when two literals are distinct (well even this is hard when you are looking at different xsd number classes but at least strings are easy!).

3. Type constraints

The third common schema requirement is to the limit the types of values a given property can take. For example:

  eg:Document rdf:type owl:Class;
              owl:equivalentClass [ a owl:Restriction;
               owl:onProperty     eg:author ;
               owl:allValuesFrom  eg:Person ].

  eg:myDoc rdf:type eg:Document ;
           eg:author eg:Daffy .
  eg:Daffy rdf:type eg:Duck.

  eg:myDoc2 eg:author eg:Dave .
  eg:Dave rdf:type eg:Person .

Does the myDoc example cause a constraint violation? No. In RDF an instance can be a member of many classes. Unless we are explicitly told that the classes eg:Duck and eg:Person are disjoint then all that happens with the myDoc example is that we infer that eg:Daffy must be a Person as well. Again a specialist processor could be developed to flag a warning in cases where an object is inferred to have type which is not a known supertype of its declared types; again this would be making additional assumptions not warranted in the general case but useful for input validation purposes.

Having got the hang that OWL is more about inference that constraint checking then what about myDoc2? Should the OWL processor infer that myDoc2 is a Document. After all we defined a Document this time using a complete, rather than partial, definition - so that anything for which all authors are Persons should be a document and the author of myDoc2 is a person. The answer, again, is "no". Just because all the authors we see happen to be people doesn't mean there aren't more authors for myDoc2 that we don't know about.

4. Value ranges

Another common schema requirement is to limit the range of a value. For example to say that an integer representing a day-of-the-month should be between 1 and 31.

Data ranges are not part of OWL at all.

You can express them within XML Schema Datatypes. You could declare a user defined XSD datatype which is an xsd:integer restricted to the range 1 to 31.

There is a problem that XML Schema doesn't define a standard way of determining the URI for a user defined datatype and the RDF datatyping mechanism requires all datatypes to have a URI. This will hopefully get "clarified" and in any case there is a de facto convention which is straightfoward, used by DAML and supported by toolkits so in the meantime we can be non-standard but get work done.

It also slightly less useful that it seems since the RDF datatyping machinery requires that each literal value have an explict datatype URI - you can't just give a lexical value and use range constraints to apply the type.

These caveats aside, the xsd user defined datatype machinery is useful and this is the one place where RDFS on its own, without OWL, can do some validation. An RDFS processor should detect if the lexical form of a typed literal does not match the declared datatype.

5. Complex constraints

The final forms of constraints that come up are ones which involve constraints between values. For example, that a pair of properties should form a unique value pair, or that the value of one datatype property must be less than another property of the same resource, or of a related resource.

No such cross-property constraints can be expressed at all OWL.

Posted by dreynold2 at 02:11 PM | Comments (0)

March 04, 2004

Thesaurus FAQ Entry: How can I make my thesaurus a part of the Semantic Web?

To make a thesaurus a part of the semantic web, simply

  • encode the thesaurus as RDF using the SKOS schemas,
  • publish the RDF data.
The SKOS schemas are RDF schemas for encoding thesauri and similar types of knowledge organisation system (KOS).

SKOS-Core is the core schema, allowing representation of thesaurus concepts, terms, and organisation of those concepts into hierarchical and associative structures. It has been designed as an extensible framework of properties, and so can be adapted to cope with different types of thesaurus.

The version of SKOS-Core currently available is a pre-release, and a good introduction to using the schema can be found here. A formal release (version 1.0) is planned shortly, along with a guide to using it - watch this space!

SKOS-Mapping is an RDF schema for creating and encoding mappings between thesauri. If mappings between thesauri are available, independent but overlapping thesauri can be used interchangeably, helping to remove the boundaries between collections and communities. A good introduction to SKOS-Mapping with examples is here.

SKOS-Mapping is also currently available as a pre-release version. A formal release can be expected shortly after SKOS-Core 1.0.

There are also a number of reports on issues relating to the use of thesauri on the semantic web, including a review of previous work and a report on multilingual thesauri. The work is ongoing, and discussed on the public-esw-thes@w3.org mailing list (archives) - feel free to join in! Categories Frequently Asked Questions (FAQs) | Thesaurus

Posted by ajmiles at 09:26 AM | Comments (0)

February 27, 2004

Proto FAQ Entry - Why not use an RDF graph with blanks for querying RDF?

A. You can, some people have, it can be useful but is much less expressive than most full RDF query languages.

In RDF, blank notes are treated as existential variables - they indicate the existence of a thing without saying anything about the name of that thing. So it is reasonable to express a query as a graph with bNodes used as if they were wildcards and to define a query operation as something like "find all instances of the query graph which are entailed by the data". Perhaps, your operation might want to the find the union of that set of matching subgraphs rather than return the separate matches, depending on the application.

This can work but it is quite restrictive.

First, bNodes can only be used in place of nodes, not in place of properties. This is a big limitation since many queries require matching over properties. Second, you can't express constraints such as string pattern matches or range constraints on the literals to be matched. To get around this, attempts at this "query by example" approach often use metalevel annotations to allow such things to be expressed. For example, see our own experiments this area, RDF-QBE. Once, you start doing this you can use the annotations to identify the query nodes in the first place and not bother using bNodes at all. This is essentially, what the simplest of the Edutella query languages, RDF-QEL-1, does.

Other limitations are the inability to express disjunctive queries this way (RDF is purely conjunctive) and the akwardness of expressing constraints between variables.

Despite these limitations the symmetry of expressing queries, and indeed the resulting matches, directly in RDF rather than indirectly encoded in RDF is appealing and could be appropriate in some applications.

[N.B. This is an early version of a FAQ entry responding to one of the items on the FAQ ideas list. I'm sure others will be able to add more information on this topic and over time the proto-entry might turn into a real entry.] Categories Frequently Asked Questions (FAQs)

Posted by dreynold2 at 04:30 PM | Comments (0)

January 27, 2004

Thesaurus Activity FAQ

Q: What can thesauri do for the web?

A: Thesauri can enrich the web in several ways.

Thesauri can be used to organise information in a sensible way, which in turn helps us to find what we are looking for on the web. Richer than a simple taxonomy, but simpler than a full blown ontology, thesauri provide a convenient yet powerful way to achieve knowledge organisation. Furthermore, because thesauri have been used for decades by library scientists for the same purpose, there exist a number of extremely well structured, well engineered thesauri in the public domain. Providing the framework for bringing these systems on to the semantic web is a major goal of the SWAD-Europe Thesaurus Activity.

A thesaurus also includes information about terminology, and how different terms may be used to represent different concepts. A thesaurus with rich terminological data can be used to support tasks such as automated classification of documents.

These are two of the ways that thesauri can help significantly reduce the energy barrier that stands before the explosion of the semantic web. By bringing existing knowledge organisation systems into the web, we reduce the effort required in the engineering of ontologies from scratch. And by supporting tasks such as automated document classification, the effort required in generating the metadata that is fundamental to the semantic web is greatly reduced.

Finally, multilingual thesauri provide new opportunities for cross-language interaction via the web.

Categories Frequently Asked Questions (FAQs) | Thesaurus
Posted by ajmiles at 03:10 PM | Comments (0)

January 21, 2004

FAQ entry - rdfs:domain and range

Q. Why do rdfs:domain and rdfs:range seem to work back-to-front when it comes to thinking about the class hierarchy?

A. Because RDFS is a logic-based system. The way rdfs range and domain declarations work is alien to anyone who thinks of RDFS and OWL as being a bit like a type system for a programming language, especially an object oriented language.

To expand on the problem. Suppose we have three classes:
eg:Animal eg:Human eg:Man

And suppose they are linked into the simple class hierarchy:
eg:Man rdfs:subClassOf eg:Human .
eg:Human rdfs:subClassOf eg:Animal .

Now suppose we have property eg:personalName with:
eg:personalName rdfs:domain eg:Human .

The question to ask is this: "can we deduce:
eg:personalName rdfs:domain eg:Man ?"

The answer is "no" the correct such deduction is:
eg:personalName rdfs:domain eg:Animal .

This is completely obvious to anyone who thinks about RDFS as a logic system, however it can be surprising if you are thinking in terms of objects.

A common line of thought is this: "surely [P rdfs:domain C] means roughly that P 'can be applied to' objects of type C, just like a type constraint in a programming language. Now all instances of eg:Man are also eg:Human so we can always apply eg:personalName to eg:Man things, doesn't that mean eg:Man is in the domain of eg:personalName?"

There are two flaws in this line of thought. First, rdfs:domain isn't really a constraint and doesn't mean 'can be applied to'. It means more or less the opposite, it enables an inference not imposes a constraint. [P rdfs:domain C] means that if you see a triple [X P foo] then you are licensed to deduce that X must be of type C. So we can see that if we make the illegal deduction [eg:personalName rdfs:domain eg:Man] then everything we applied eg:personalName to would become a eg:Man and we could no longer have things of type eg:Human which aren't of type eg:Man. Whereas the correct deduction [eg:personalName rdfs:domain eg:Animal] is safe because every eg:Human is an eg:Animal so the domain deductions don't tell us anything that wasn't already true, so to speak!

The second flaw is in the phrasing "is in the domain of". It is true that eg:Man is, in some sense, "in the domain of" eg:personalName but the correct translation of this loose phase is that "eg:Man is a subclass of the domain of eg:personalName" which is quite different from saying ":eg:Man *is* the domain of eg:personalName."

Categories Frequently Asked Questions (FAQs)
Posted by dreynold2 at 05:28 PM | Comments (2)