SWEO Community Project: Linking Open Data on the Semantic Web
Equivalence Mining and Matching Frameworks
This page collects software tools and papers about techniques that can be used to auto-generate links between data items within different datasources.
The page is part of the community project wiki:SweoIG/TaskForces/CommunityProjects/LinkingOpenData
An example of an equivalence link is <http://dbpedia.org/resource/Berlin> owl:sameAs <http://sws.geonames.org/2950159> claiming that a data item in the dbpedia dataset is the same as a data item in the Geonames dataset.
Simple alternative which avoids the need of equivalence mining is to use commonly accepted identifiers within URIs. For example, the RDF book mashup uses ISBN numbers in its URIs- This allows other data sources about books to set links to the data items of the book mashup using a simple URI-pattern including the ISBN number.
Software Tools
Silk - A Link Discovery Framework for the Web of Data The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web.
TopBraid Composer (ontology editor made by TopQuadrant) has a wizard for linking ontology instances to corresponding DBpedia concepts. See for details.
SemMF SemMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties.
Yves Equivalence Miner together with an experience report about the problems he ran into while interlinking Jamendo and Musicbrainz.
MOAT: Meaning Of A Tag Framework for manually interlinking tags with Semantic Web URIs (such as URIs from dbpedia, geonames … or any knowledge base)
People Interested in the Area
Stefano Mazzocchi (work plan)
- Felix Van de Maele
- Chris Bizer (I want to set links from the dbpedia dataset to other datasets. Already done: geonames, planed: Musicbrainz, US Census data. If you have other datasets that fit to be linked to dbpedia, please let me know.)
- Tom Heath (I'm primarily interested right now in very lightweight, low-cost heuristics/hacks to link up things, places, and reviews; the RDF Bookmashup ISBN approach is the kind of place I'm looking to start)
- Yves Raimond
- Hugh Glaser (Doing a lot of this stuff between big people, projects and publications sources.)
- Oktie Hassanzadeh and Mariano Consens (Currently developing a tool for finding links between different data sources using state-of-the-art similarity join techniques)
Papers and Web Resources on the Topic
Yves Raimond, Christopher Sutton and Mark Sandler: Automatic Interlinking of Music Datasets on the Semantic Web. LDOW 2008 Paper.
Afraz Jaffri, Hugh Glaser and Ian Millard: URI Disambiguation in the Context of Linked Data. LDOW 2008 Paper.
Andriy Nikolov, Victoria Uren, Enrico Motta and Anne de Roeck: Handling instance coreferencing in the KnoFuss architecture, 2008.
A. Nikolov, V. Uren, E. Motta, A. de Roeck: KnoFuss: A comprehensive architecture for knowledge fusion. K-CAP 2007, Whistler, Canada, 2007.
Christian Becker, Chris Bizer, Georgi Kobilarov: BBC interlinks with DBpedia, 2008
Chris Bizer, Tom Heath: Auto-generated owl:SameAs links between the RDF Book Mashup and the DBLP database
Stefano Mazzocchi: Rewiring Scenarios
Yves Raimond: Linking open data: publishing and linking the Jamendo dataset
Yves Raimond: Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets
Alani, H., Dasmahapatra, S., Gibbins, N., Glaser, H., Harris, S., Kalfoglou, Y., O'Hara, K. and Shadbolt, N. Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web (2002).
This stuff has been done over and over in the database community, often called duplicate recognition or record linkage. So if somebody knows good overview papers about the area please add them to this page, so that people don't have to reinvent the wheel.
Duplicate Record Detection: A Survey. by Elmagarmid et al. TKDE, 2007.
Tutorial on Approximate Joins. VLDB, 2005
- Fellegi: A theory of record linkage. Journal of the American Statistical Association, 1969
- Hernandez: Real-world Data is Dirty: Data Cleansing and The Merge / Purge Problem. Data Mining and Knowledge Discovery, 1998
There was a workshop on Ontology Matching at ISWC 2006. The approaches proposed there
might also be useful for equivalence mining on data item/instance level.