HCLSIG BioRDF Subgroup/Brainstorming

From W3C Wiki

Brainstorming

Please suggest ideas for work that BioRDF could undertake.

OBO Structured Digital Abstracts

(added by Matthias Samwald, this is actually a long-term project, but we could lay the foundations within the next few months)

During the last year several independent groups developed interest in Structured Digital Abstracts (SDA). A SDA is a simple, machine-readable representation of the facts found in a journal article or a database submission. The idea is that this annotation becomes part of the scientific publication process, e.g., that authors create the SDA for their articles themselves -- they know much better than any curator or automated NLP routine. In addition to that, annotations can also be created by other researcher, e.g., as a side-product of personal knowledge management (scientific bookmarking/tagging services etc.), or by automated natural language processing. Ideally, all of the annotation sources (human data/text creator, human data/text consumer, automated NLP) would work together in a synergistic manner.

Related ongoing projects:

http://sdabstract.org/

http://wiki.gersteinlab.org/sda/index.php/Main_Page

http://www.cashewprize.org/

RDF/OWL together with OBO ontologies, the simple relations defined in the OBO relation ontology and stable URIs for biomedical databases are the perfect technology for this task. Dependent on the motivation and skills of the annotator, the annotations could reach from the very simple to the very intricate. In the most simple scenario the annotations would just consist of simple 'tagging' with terms from OBO-in-OWL ontologies and URIs of database resources (e.g. sequence records from Uniprot-RDF). In the most detailed case, the annotations could consist of the description of processes (e.g., protein binding, pathways, physiology), structures (e.g., anatomy) and qualites (e.g., phenotypes), based on the simple relations defined in the OBO relation ontology. In either case, the annotations should be based on the widely accepted OBO resoures. Furthermore, the descriptions should be only qualitative and not quantitative (e.g., without numeric values etc.).

The project would require us to

  • clearly define the ontological representation
  • create a web-based user interface for the creation of such OBO annotations (possibly based on a NLP backend for automated entity suggestion, e.g. the Whatizit text extraction service of EBI that has already be used for the Science Commons text annotation service)
  • network with scientific publishing houses (Elsevier, Nature?) and database providers (Uniprot, National Library of Medicine?)
  • an application to aggregate, query and visualize the distributed annotations (possible based on DERI Sindice ?

Some potential participants with valuable expertise are: DERI Galway, Science Commons, Structured Digital Abstracts consortium (Stanford), Center for Medical Informatics and Gerstein Group (Yale), Computational Biology Lab at the Sloan-Kettering Memorial Cancer Center, Semantic Web Company, Medical University of Vienna.

OntoWiki and Semantic Media Wiki

(added by Matthias Samwald)

Semantic Web Wikis will be very important for making Semantic Web content accessible to end-users. OntoWiki and Semantic MediaWiki are the two most mature semantic wikis currently available. We need to explore how they can be used for the HCLS sector, and where there is still room for improvement.

FDA2RDF

Would like to see some of the information available from the FDA converted to RDF if this hasn't been done previously. For example:

Orange Book - http://www.fda.gov/cder/ob/default.htm

Mike Bevil (mike_bevil@merck.com)

HCLS KB Usability

While formulating SPARQL queries is indeed a powerful approach to querying the HCLS KB, it remains well outside the ability of the typical biomedical scientist. Web-based, user interfaces are required to facilitate query construction that completely hide away all the details of querying complex knowledge. Here are some ideas

- Ajax-based inline suggestion box using Manchester OWL syntax - see SMART

- Query composition "wizard" - see Biozon

- Use of Exhibit - see Clinical Demo - the caveat being that we will need to first dynamically create the content for visualization.

Michel Dumontier (michel_dumontier@carleton.ca)

HCLS KB Decentralisation

(added by Matthias Samwald)

We should try to transform the HCLS KB into an infrastructure of distributed SPARQL endpoints and/or linked data, administered by the data providers. This is a more realistic approximation of the future structure of the HCLS Semantic Web, it is also easier to maintain in the long-term.

HCLS KB mapping to the open linked data repositories

(added by Matthias Samwald)

It would be great if the two largest coherent Semantic Web structures, namely the HCLS Knowledge Base and the Linking Open Data datasets could be mapped to each other. Some possible anchors in the LOD datasets:

  • DBpedia (besides label-matching, we could also make use of information like CAS number, Uniprot/Pubmed references etc. that are part of DBpedia)
  • YAGO
  • W3C Wordnet
  • OpenCyc 1.0

Some possible anchors in the HCLS KB:

  • All of the OBO ontologies
  • MeSH in SKOS
  • NeuronDB

Since DBpedia is already mapped to most other datasets in the LOD collection, DBpedia would probably be our primary target for such mappings. Since the ontological foundation of most LOD datasets is relatively loose compared to most HCLS datasets, the mapping should not be done with owl:sameAs or owl:equivalentClass statements, but rather with softer statements such as rdfs:seeAlso. The only exception is OpenCyc, where a more stringent mapping with owl:equivalentClass is possible. A mapping between the SKOS version of MeSH and the SKOS version of Wikipedia categories could also receive special attention: we could use owl:sameAs (although this is discouraged by the SKOS specification) or the specialized SKOS mapping vocabulary.

Furthermore, we should

Even if we decide that adhering to the linked data practices is not of special interest to us, a mapping to the LOD datasets would still be very valuable in itself.

Convert GENIA event corpus to RDF

The Tsuji lab has just released the GENIA event corpus (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=Event+Annotation), which is described in this paper (http://www.biomedcentral.com/1471-2105/9/10). It would be great if this corpus was accessible as a semantic web resource for the HCLSIG community.

Bring chemical structures into BioRDF

We have InChI strings as URIs for chemical structures, but the other ways of representing chemical structures (ChemBLAST, CO, ChEBI, MQL) need to be aligned and made consistent.

Colin Batchelor (batchelorc@rsc.org)

Rule Responder HCLS

The goal of Rule Responder HCLS is to provide a flexible and distributed eScience rule inference infrastructure in the domain of Health Care and Life Science, which enables distributed deployment of rule responder services on the Web implementing declarative rule-based decision logic and semi-automated reaction logic on top of the existing Web-based scientific services/tools and data sources, e.g. dynamically accessing the W3C HCLS KB via its SPARQL web interface and making rule-based decisions on the basis of the queried facts.

A particular use case identifying experts in Alzheimer disease research has been implemented dynamically integrating and using the UniProt beta RDF, GoPubMed Statistics, EMBL-EBI Patent Abstracts and the W3C HCLS KB SPARQL service:

http://ibis.in.tum.de/projects/paw/hcls/

Some quick thoughts on further rule-based applications (with the idea of bringing together the work done in the W3C HCLS RDF and the W3C HCLS COI group):

- Medical Decision Support: http://www.w3.org/2005/rules/wiki/UCR#Ruleset_Integration_for_Medical_Decision_Support

- human resource management, e.g. find experts in a particular field of research or find the best appropriate available staff (the rule-based logic would need to be flexible and handle exceptions to the standard norms, e.g. in a clinical process flow in case of an exceptional case)

- personalized rule-based information agent or information dash board performing automated (re-)actions or informing the recipient about situations of interest, e.g. inform a drug designer about a significant increase in recent publications of protein-based drugs for a particular disease.

- automatically generate vaccination calendars

- select patients for clinical trials considering rule-based inclusion/exclusion criteria (see W3C HCLS COI use case)

- assemble menus for hospitalized patients and prepares orders for the ingredients according to disease, patient info, drugs, allergies, …

- identify disease from patient symptoms and patient history

Adrian Paschke (adrian.paschke@biotec.tu-dresden.de)

Semantic Web Portal for Neuroscientific Data Mashup

As the need for neuroscience data integration is growing and the number of neuroscientific datasets available in RDF/OWL format is increasing, we are presently at the neuroinformatics frontier of exploring the full potential of Semantic Web technologies in enabling integrative neuroscience research (including translational research) that requires integration of diverse types of neuroscientific data provided by different sources in heterogeneous formats. As a pilot project, we have prototyped a user-friendly Web application called “Entrez Neuron” that allows the user to perform keyword searches for neuron-related information across multiple data sources (OWL ontologies) including SenseLab (NeuronDB and ModelDB) ontologies and CCDB (Cell Centered Database) ontologies. Our future plan is to expand this pilot application to become a Semantic Web Portal that allows semantic mashup of diverse types of data in neuroscientifically meaningful ways. To achieve this, we propose to include additional data/ontology sources and create different facets that represent user-centric aspects (or views) of ontologies. For example, “Entrez Neuron” currently features a “brain region/neuron” facet for organizing query results (about neurons) based on anatomical structure. Such a hierarchical structure is intuitive to neuroscientists, while benefiting from the machine use of ontologies in terms of querying, organizing, and integrating data. Additional facets that we are considering include (but not limited to) drug, disease, pathway, phenotype, gene functions, etc.

Project members: Kei Cheung, Matthias Samwald, Ernest Lim, Huajun Chen, Pradeep Mutalik, Luis Marenco

SIOC for Science

(added by Matthias Samwald)

Goal: Exploring the adaption of SIOC for the representation of scientific discussion. Some of the creators of the SIOC specification are interested in extending SIOC for such use cases. I created a prototypical ontology and small demo scenario. Please note that 'SIOC for Science' should not have the goal of being redundant with more sophisticated ontologies for the representation of scientific discourse, such as SWAN or SALT. Rather, it should be a only a simple extension of SIOC, with greater emphasis on the integration in distributed internet and intranet communities (blogs, mailing lists, bulletin boards etc.)

Potentially interested parties: DERI Galway, Semantic Web Company

Define a simplified OWL-to-RDF mapping for a subset of OWL

(added by Matthias Samwald)

This would probably be more in scope of an OWL task force and not the HCLS interest group, however, it seems like this issue is of special significance for many HCLS ontologies. Many of the current ontologies in the life sciences consist of a large amount of classes and OWL property restrictions to encode relations between these classes. This is mostly caused by the fact that many of the entities we are dealing with ARE classes in reality. Several solutions have been proposed, e.g. the use of extended versions of SPARQL('SPARQL-DL’). However, these are not viable solutions for the immediate future. Establishing a new, widely accepted standard for an enhanced SPARQL syntax and query engine would take a long time, and it would take even longer until it would find wide-spread implementation in triplestores. The use of reasoners for querying can cause performance problems, is more restricted than SPARQL, and could not be implemented in all use cases (e.g., restricted access to server). It would be very useful to have an alternative, simplified RDF representation for these ontologies, so they can be easily queried by standard RDF query languages like SPARQL. For example, one transformation step could transform "S has P some O" class restrictions into direct relations between the two classes ("S P O"). Many OWL expressions such as cardinality restrictions could not be represented with this RDF mapping -- however, such information is often not required during query-time.

Using Current Terminologies

Proposed by Eric Neumann

Consider how to include currently used terminologies such as those from UMLS. I don't know if Olivier Bodenreider has been on any of the calls recently, but his proposed offer to mint uris from CUI (UMLS, MeSH, etc) is something that would be of immense value for any of the listed projects. This would help establish some examples of true linked bio-data!

Starting with MeSH might be a good idea as it it does not have the intellectual restrictions inherent to other sources in the UMLS. (Olivier Bodenreider)

Linking traditional medicine to HCLS KB

(Proposed by Huajun Chen, Kei Cheung, Matthias Samwald)

The term traditional medicine (Indigenous medicine or folk medicine) describes medical knowledge systems, which developed over centuries within various societies before the era of modern medicine. A typical example is TCM ( Traditional Chinese Medicine) which is an ancient medical system that accounts for around 40% of all health care delivered in China.

This project is aimed at exploring the potential value of combining traditional medicine and modern medicine synergistically at the data level. The first task is going to be emphasized on integrating herbal data resources through current TCM databases into the HCLS KB, in purpose of mining enthnopharmacological knowledge to discover new pharmacological lead compounds.

With its extensive knowledge over herbal medicine, acupuncture, massage, etc., we believe traditional medicine can serve as a valuable knowledge base that may complement modern approaches to drug discovery and clinical treatment. For example, the drug Huperzine for Alzheimer's Disease was derived from a Chinese botanical herb firmoss Huperzia serrata. In another example, TCM medicine has been extensively utilized in the treatment of Wilson disease, in combination with western medical approaches.

The key tasks include:

  • Identify key traditional medicine data resources and develop RDF/OWL-based representation model that matches with HCLS KB model.
  • Develop/adopt use cases and best practices to illustrate the value of combining eastern and western medicine synergistically.
  • Develop methods and tools in support of mining enthnopharmacological knowledge to discover new pharmacological lead compounds.
  • Explore issues relevant to cross-languages ontology mapping and cross-cultures information retrieval.