SWEO Community Project: Linking Open Data on the Semantic Web
Equivalence Mining and Matching Frameworks
This page collects software tools and papers about techniques that can be used to auto-generate links between data items within different datasources.
The page is part of the community project wiki:SweoIG/TaskForces/CommunityProjects/LinkingOpenData
An example of an equivialence link is <http://dbpedia.org/resource/Berlin> owl:sameAs <http://sws.geonames.org/2950159> claiming that a data item in the dbpedia dataset is the same as a data item in the Geonames dataset.
Simple alternative which avoids the need of equivalence mining is to use commonly accepted identifiers within URIs. For example, the RDF book mashup uses ISBN numbers in its URIs- This allows other data sources about books to set links to the data items of the book mashup using a simple URI-pattern including the ISBN number.
Software Tools
TopBraid Composer (ontology editor made by TopQuadrant) has a wizard for linking ontology instances to corresponding DBpedia concepts. See for details.
SemMF SemMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties.
Yves Equivalence Miner together with an experience report about the problems he ran into while interlinking Jamendo and Musicbrainz.
MOAT: Meaning Of A Tag Framework for manually interlinking tags with Semantic Web URIs (such as URIs from dbpedia, geonames … or any knowledge base)
People Interested in the Area
Stefano Mazzocchi (work plan)
- Felix Van de Maele
- Chris Bizer (I want to set links from the dbpedia dataset to other datasets. Already done: geonames, planed: Musicbrainz, US Census data. If you have other datasets that fit to be linked to dbpedia, please let me know.)
- Tom Heath (I'm primarily interested right now in very lightweight, low-cost heuristics/hacks to link up things, places, and reviews; the RDF Bookmashup ISBN approach is the kind of place I'm looking to start)
- Yves Raimond
- Hugh Glaser (Doing a lot of this stuff between big people, projects and publications sources.)
Papers and Web Resources on the Topic
Chris Bizer, Tom Heath: Auto-generated owl:SameAs links between the RDF Book Mashup and the DBLP database
Stefano Mazzocchi: Rewiring Scenarios
Yves Raimond: Linking open data: publishing and linking the Jamendo dataset
Yves Raimond: Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets
Alani, H., Dasmahapatra, S., Gibbins, N., Glaser, H., Harris, S., Kalfoglou, Y., O'Hara, K. and Shadbolt, N. Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web (2002).
This stuff has been done over and over in the database community, often called duplicate recognition or record linkage. So if somebody knows good overview papers about the area please add them to this page, so that people don't have to reinvent the wheel.
Koudas: Approximate Joins. VLDB, 2005
- Fellegi: A theory of record linkage. Journal of the American Statistical Association, 1969
- Hernandez: Real-world Data is Dirty: Data Cleansing and The Merge / Purge Problem. Data Mining and Knowledge Discovery, 1998
There was a workshop on Ontology Matching at ISWC 2006. The approaches proposed there
might also be useful for equivialence mining on data item/instance level.