HCLSIG/LODD/Interlinking

From W3C Wiki
< HCLSIG‎ | LODD

Interlinking Methodology in the LODD project

There are many commonly used identifiers in the life sciences that can be utilized for making links between data sets explicit. Links that were generated based on shared identifiers include the connections from LinkedCT to Bio2RDF's PubMed, and from DrugBank to DBpedia. The connections between bioinformatics and cheminformatics data sources are already provided by Bio2RDF allowing us to interlink our drug-related data sets to their work.

In cases where no shared identifiers exist, state-of-the-art string and semantic matching techniques were applied for link discovery. Approximate string matching was employed to interlink LinkedCT and Diseasome, where for instance "Alzheimer's disease" in LinkedCT was matched with "Alzheimer_disease" in Diseasome. Semantic matching is especially useful in matching clinical terms as many drugs and diseases have multiple names. Drugs tend to have generic names and brand names, for example, "Varenicline" has the synonym "Varenicline Tartrate" and the brand names "Champix" and "Chantix".

Semantic link discovery in this project is performed using the following novel link discovery tools:

  • LinQuer [1] is a novel tool for semantic link discovery over relational data. The LinQuer framework consists of LinQL, a declarative language that allows specification of linkage requirements in a wide variety of applications. The framework then rewrites LinQL queries into standard SQL queries that can be run over existing relational data sources. LinQuer is particularly useful due to the fact that most of our data is published using tools that operate over relational data sources (such as D2R Server). LinQuer allows semantic link discovery based on state-of-the-art string and semantic matching techniques and their combinations.
  • Silk [2] discovers links between data sources. It provides a declarative language for specifying the link types and conditions. The implemented similarity metrics include string, numeric, data, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two concepts within a concept hierarchy. Each metric evaluates to a similarity value between 0 and 1 (higher values indicating a greater similarity). Metric results can be weighted and form an overall similarity value.

More on the interlinking methodology and statistics will be made available soon.

[1] O. Hassanzadeh, R. Xin, R. J. Miller, L. Lim, A. Kementsietsidis, and M. Wang, Linkage Query Writer, To Appear in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009) - Demonstrations Track

[2] Volz, J., Bizer C., Gaedke, M., and Kobilarov, G.: Silk – A Link Discovery Framework for the Web of Data. In: Linked Data on the Web workshop at WWW2009, 2009.

Interlinking

The figure below shows the data sets that have been published and their interlinking pathes so far.

2009-07_lodd_interlinking_by_type.png

Number of outgoing Links

Data Set
DailyMed
DrugBank
DailyMed
LinkedCT
RDF-TCM
SIDER

Linkage types

Source Data Source Target Data Source Number of Links
DailyMed LinkedCT 27,685
DailyMed LinkedCT 44
DailyMed DBpedia 49
DailyMed DBpedia 2504
DailyMed Diseasome 6,124
DailyMed DrugBank 1,593
DailyMed RDF-TCM 21
Diseasome DBpedia 1,300
Diseasome DBpedia 643
Diseasome GeneID 688
Diseasome HGNC 688
Diseasome OMIM 2,929
Diseasome Symbol 9,743
Diseasome LinkedCT 372
Diseasome DailyMed 6,124
Diseasome DrugBank 8,202
Diseasome RDF-TCM 313
Diseasome RDF-TCM 63
DrugBank ChEBI 736
DrugBank PDB 3,379
DrugBank CAS 2,240
DrugBank Pfam 19,082
DrugBank UniProt 4,660
DrugBank HGNC 1,675
DrugBank GeneID
DrugBank Symbol 1,533
DrugBank LinkedCT 12,127
DrugBank DBpedia 187
DrugBank DBpedia 1,522
DrugBank Diseasome 8,202
DrugBank DailyMed 1,593
DrugBank KEGG 913
DrugBank KEGG Compound 1,331
DrugBank RDF-TCM 384
DrugBank RDF-TCM 1
DrugBank PubMed 96
LinkedCT DailyMed 27,685
LinkedCT DrugBank 12,127
LinkedCT Diseasome 372
LinkedCT Geonames 129,177
LinkedCT DBpedia 8,848
LinkedCT Yago
LinkedCT PubMed 42,219
LinkedCT RDF-TCM 141
RDF-TCM DBPedia 649
RDF-TCM DBPedia 496
RDF-TCM DBPedia 255
RDF-TCM Sider 171
RDF-TCM Diseasome 313
RDF-TCM Diseasome 63
RDF-TCM DrugBank 1
RDF-TCM DrugBank 384
RDF-TCM EntrezGene 944
RDF-TCM DailyMed 21
RDF-TCM LinkedCT 141
Sider RDF-TCM 171
Sider DrugBank 1,140
Sider DailyMed 1,986
Sider Diseasome 238
Sider DBpedia 1,392
Sider DBpedia 735
Sider STITCH 14,894
STITCH DBpedia 123

Metadata about Interlinking