TaskForces/CommunityProjects/LinkingOpenData/SemanticWebSearchEngines/BOF Meeting At WWW2008

From W3C Wiki

This is a few notes concerning a BOF meeting that took place at WWW 2008 on the topic of "Semantic Web Search Engines and their Applications". This page does not constitute minutes of the meeting, but general notes about the topics that have been discussed. Other pages will be set up with additional content concerning these topics.

Generalities

The goal of the meeting was to gather developers and users of semantic web search engines to discuss the variety of existing systems, existing and required functionalities, collaborations, standardization, and generally the evolution of semantic web search engines.

Starting point

The following questions have been set as a starting point for discussion:

  • What are the current semantic web search engines?
  • How are they different?
  • What is the overlap? Is there a common base?
  • Could effort be combined?
  • What are the applications? existing? future?
  • What are the remaining gaps? What is not covered?
  • How do we evaluate semantic web search engines?
  • Are number of documents/entities/triples enough?
  • Are we actually all counting in the same way?
  • Should semantic web search engines be more open than classical web search engines?
  • Working together instead of competing?

Outcome

The major outcome of the meeting is the creation of an informal and extensible interest group concerning semantic web search engines and their applications.

Concretely, this is reflected by the creation of a mailing list including all the people present at the meeting, plus other people interested, and by the creation of a set of wiki pages concerning different sub-topics of semantic web search engines. The starting point for these pages is the existing page on the ESW/LOD project wiki.

Present

Mathieu d'Aquin (Watson), Christian Becker, Chris Bizer, Uldis Bojars, Ivan Bedini, Hugh Glaser, Giovanni Tummarello (Sindice), Weiyi Ge (Falcons), Gong Cheng (Falcons), Ivan Mikhailov, Orri Erling, Yrjana Rankka

Survey on existing Semantic Web Search Engines

There has been a rapid increase in the number of semantic web search engines recently and one important activity is to provide an overview of these systems. It appears clearly that different systems have different focuses and strengths. In particular, Sindice clearly targets scalability, Watson puts more emphasis on the level of service it provides to applications and on ontologies, and Falcons puts more effort on user interfaces to search semantic web data.

Comparing and understanding these systems is difficult, as there is no clear criterion to do so. The simplest, but nevertheless most broadly used, measure is the number of semantic web documents they collect. However, this measure is misleading since different systems don't consider semantic documents in the same way, and handle duplication differently. The use of number of triples is proposed, but this basically suffers the same issues and is anyway insufficient to assess systems as complex as semantic web search engines.

There is a need for a survey on semantic web search engines where existing systems are assessed according to different aspects, using a variety of criteria. Starting from the existing list of search engines, a wiki page with an initial structure for such a survey is set up.

Gathering user needs

In order to guide the evolution of semantic web search engines, it is important to have a broad overview of what is required, what are the tasks in which these systems are, or can be, used.

For example, it appears that for some users, semantic web search engines should not only locate semantic web data, they should also provide the access point (SPARQL endpoint, wrapper service, etc.) to query this data. Some systems already respond partly to this need (e.g. Watson or SWSE by providing their own SPARQL endpoints).

As a starting point, a wiki page is set up for collecting existing or potential applications of semantic web search engines.

Standardization, common interfaces, APIs

To reduce the issues related to the diversity of systems, a discussion should take place on establishing common interfaces and APIs, so that different semantic web search engines could be accesssed in an homogeneous way.

In particular, a starting point is the interface to Ping (submit document) the search engine. Sindice adopts a similar API to the one of PingTheSemanticWeb.com and it is generally encouraged that other search engines do the same.

Standardized interfaces and APIs should also be considered for accessing and searching with semantic web search engines. A starting point for that could be Open Search.

Finally, an important element of discussion concerns semantic sitemap, a format to declare to search engines the data available on a site, and the existing access points. Currently, only Sindice processes this format, and anyway, very few sites contain a valid sitemap. It is encouraged for systems to use and process semantic sitemaps.

Ranking

As part of the surveying activity, it is useful to know what kind of ranking mechanism is applied by different search engines, to help users and applications in selecting semantic data. Also, as part of the user requirements gathering, it is important to know what is required by users on this aspects.

  • Sindice: basically, some doc/term frequency analysis
  • Falcons: doc/term frequency
  • Watson: based on complexity (use of language primitive) + size. Working on a flexible/customizable ranking based on quality metrics and user evaluations.
  • Swoogle: Popularity/PageRank like algorithm