RdfSmushing

From W3C Wiki

Smushing or aggregating RDF

This term is often used to name the process of aggregating resources based on inverse functional properties. If two resources have the same inverse functional property, they are owl:sameAs and their other properties can be intermixed - hence smushed.

Smushing Implementations

  • LeoSauermann recently found that ELMO, an add-on for sesame, supports some smusher. Here is the JavaDoc for the package: elmo smusher javadoc. The project itself is part of sesame and can be found on the sesame website.
  • according to this rdfweb post by MortenFrederiksen smushing of a dataset with 6733112 statements takes 16h using Redland. Dated Tue Mar 23 23:55:44 GMT 2004. The smushing there is based on IFP search and aggregation. The implementation is in rdf-smush.c.
  • The Graph Versioning System GVS [1] does smushing based on functional and inverse functional properties

Smushing Algorithm

A typical smushing algorithm would be (described in Leo Sauermanns Blog)

I am putting together more about smushing, which will be a key factor in the global semantic web: to connect annotations that were made by different people.

A typical smushing algorithm would be:

  • take a large datastore DS that contains a set of triples Tset = {Ta, Tb, Tc, ... }
  • iterate through known InverseFunctionalProperties IFPset = {Ia, Ib, Ic, ....}
  • for each InverseFunctionalProperty Iy that is represented in the Tset as predicate, do a check for smushing.
  • find all triples TxIy so that Tx has Property Iy
  • find one triple Txc of TxIy that points to a grounding resource / canonical resource (see below)
  • Use the subject Sx from the triple Txc and aggregate all other triples of subjects of TxIy to Sx. This means, change the subject in the triples to Sx.
  • add owl:sameAs triples to connect all Subjects(TxIy) to Sx

The problem is, when you have a set of triples TxIy that have several subjects that should be the same - as defined by IFP - to choose which subject is the "canonical" subject and should now be filled with the triples.

There are different approaches to find the canonical resource:

  • take by random
  • prefer the resource that is annotated in special ontology (ie prefer SKOS concepts over foaf:Persons)
  • prefer the more public resource (googlefight, public urls wins over private uris)
  • prefer the best annotated resource (the resource with the most triples - attention, this is self-amplification of single resources)
  • prefer the resource with the shortest / the longest uri
  • prefer named resources over anonymous resources (this is very important, you must not smush to anonyms)

Another question is what to do with the smushing. Different approaches

  1. store the smushing in an extra graph
  2. delete the old triples, add the smushing
  3. add the smushing additional to the old triples (tricky)

Each has obvious advantages and disadvantages. For gnowsis I would prefer (1)to smush into an extra graph, which is similiar to (3) but seperates the data.

In gnowsis we have the problem of incremental smushing, which means that we crawl thousands of emails per day and then would like to smush the persons in the addresses, but only of the new messages.

zeno ruset

I can see why you'd want to replace bnodes with a URI if possible, and if all the subjects are bnodes then they'll smush down to a single bnode. But beyond that the need for a 'canonical' subject sounds like it's app-specific.

Ok if:

  • A someIFP X .
  • B someIFP X .
  • C someIFP X .
  • A someProperty Y .
  • B someProperty Z .

where A,B,C,X,Y,Z are named resources, following the approach suggested, with A as the preferred resource, you'll infer -

  • A someProperty Z .

which is probably useful if A is a person and you want to pull out a vCard card representation from all their attributes (but then if you're using FOAF they won't have a URI anyhow...). But what is there to suggest you won't have a use for:

  • C someProperty Z .

I'm not sure, but without good justification it seems premature to ignore that last statement.

-- DannyAyers

zeno ruset

Once you're working on a smushed model, you need to canonicalize your subjects. I.e. you should be looking for statements of the form

  • C owl:sameAs foo .
  • foo someProperty Z .

-- AlexStewart

zeno ruset