SweoIG/TaskForces/CommunityProjects/LinkingOpenData/EuroStat

From W3C Wiki


RDFizing and Interlinking the EuroStat Data Set Effort - riese

A LinkingOpenData project



Note: On 2008-01-31, riese has been launched, cf. http://riese.joanneum.at/

This page is the main resource for the "RDFizing and interlinking the EuroStat data set" - riese project, an effort in the realm of the SWEO LinkingOpenData project.

Introduction

While the US census data is already more or less available, the European equivalent is still not totally RDFized and interlinked. A first attempt was made at the FU Berlin.

We aim at RDFizing the whole EuroStat data. Just to give you an idea: We're talking about some -- conservatively estimated -- 4.000.000.000 (or 4 billion) RDF triple. This estimation is based on the EuroStat TOC, assuming some 10 triples per item value.

Design

Schemas

In a first step, the existing EuroStat data schema is recreated in RDF (using RDF-S), along with a mapping to an RDF graph.

YvesRaimond's comments:

  • Using rdfs:label instead of riese:hasLabel and riese:hasItemValue?
  • Not sure about the time representation here: does TimePoint identify a single time point (yesterday,7pm) or a set of time points (everyday at 7pm)? In the first case we could just use dc:date for the sake of simplicity. For the second case, we might use a little bit of OWL-time?
  • rdf:value instead of riese:hasDicValue and hasItemValue?
  • wrt. to the geographic parts, we could re-use the geonames ontology?

Michael's answers to YvesRaimond's comments:

  • I actually thought about using rdfs:label everywhere, though I tend to introduce new props if semantically should state something different. So I guess it might be a good idea to make riese:hasLabel an rdfs:subPropertyOf rdfs:label (or replace it), but with riese:hasItemValue I think it is cleaner to have separat prop (it is the item's value at last :)
    • Ok! I guess it doesn't matter much in this case. However, I'd rather use nouns instead of verbs, so I would put riese:itemValue instead of riese:hasItemValue, and riese:label instead of riese:hasLabel.
  • riese:TimePoint specifies the point in time (which granularity is defined by the according riese:TimeDataFormat) at wich an item's value is valid (I'll make up an example, soon ...); dc:date is not flexible enough wrt formatting; OWL time might be an overkill, but I'll have a look into it what we can utilise.
    • Hmm, I guess I am not clear about that... So your time points are actually intervals (if it's yearly, then one time point covers an interval of one year). In this case, I would just use a "startsAt" property and a "duration" property (having respectively xsd:dateTime and xsd:duration as a range), I think this should be expressive enough. I would also rename riese:TimePoint as riese:Interval. Actually, the more I think about it, the more I think the event ontology at [1] may be enough to cleanly model this "time dependent" part, as what we are trying to model is just a classification of a space/time region... This would look like:

@prefix riese: <http://riese.joanneum.at/core#>.
@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix time: <http://purl.org/NET/c4dm/timeline.owl#>.
@prefix dic: <http://riese.joanneum.at/core#>.
@prefix : <>.

# Ontology
riese:Item rdfs:subClassOf event:Event.
riese:value rdfs:subPropertyOf event:literal_factor.
riese:unit rdfs:subPropertyOf event:factor;rdfs:subPropertyOf riese:dic.
riese:s_adj rdfs:subPropertyOf event:factor. # and so on for each DIC...

# Some instance data 
#  http://europa.eu.int/estatref/info/notes/en/read_me.htm
:data a riese:Item;
   riese:value "11148";
   riese:unit "mio-eur"; # are these properties operating over a discrete space? in this case we should consider creating individuals for these, and we could just use riese:dic
   riese:s_adj "nsa";
   riese:partner "ext_eurozone";
   riese:flow "net";
   riese:indic "bp-100";
   event:place <http://dbpedia.org/resource/Europe>;
   event:time  :int2004m05; # or just [time:at "2004-05"^^xsd:YearMonth]
   .

:int2004m05 a time:Interval;
   time:at "2004-05"^^xsd:YearMonth;
   .

What do you think? The good think is that it is extensible - if we happen to access more things on how these values are captured, we can still attach them to the event...

  • rdf:value instead of riese:hasDicValue, again I'd prefer rdfs:subPropertyOf rdf:value, but open to discuss. Not sure about riese:hasItemValue ... you proposed rdfs:label earlier ... hm I guess we need a #swig session ;)
  • reuse the geonames ontology: absolutely; AND interlink it ...
  • change log for v0.1:
    • removed riese:hasLabel and added rdfs:label, instead
    • removed riese:hasItemValue and added rdf:value, instead
    • added rdf:value to riese:TimePoint
    • changed riese core NS to http://riese.joanneum.at/core#, which will be the actual server hosting the riese stuff

Observations

  • Looking at the TOC of the EuroStat data set it seems to be a good idea to use the so called "open datasets" marked with Full download (and trailing _t in the code) rather than the individual tables. For example innore_t.tsv might be preferred over ir010.tsv - ir140.tsv. However, several data are only available as individual datasets/tables. There appears to be no difference between datasets and tables. In order to get all data, both datasets and tables have to be used.
  • As the geo.dic offers more structured data, the proposed riese geo schema seems reasonable. Note: The cities and regions are available in the native language, ie. it is "Wien" not "Vienna" - might consider using an xml:lang tag? The files partner.dic, fats_own.dic, trs_geo.dic, load.dic and unload.dic contain exactly the same information as geo.dic. Other files offer deeper semantics as well.

Resources

Publications and Presentations

Related

People Interested in the Area

Please add yourself here in case you want to contribute (in terms of schema, mapping, development, testing, UI, etc.).

Note: The main discussion forum is the #swig channel at freenode, seeAlso log at [2].