RDF Store Benchmarking

This page collects references to RDF benchmarks, benchmarking results and papers about RDF benchmarking.

At the end of the page, we collect use cases for RDF benchmarking and offer ideas for future discussion on benchmarking triple stores.

RDF Benchmarks

Benchmarking Results

Results provided by store implementers themself:

Results provided by third parties:

Publications about RDF Benchmarking

Blog Posts about RDF Benchmarking

Large real-world datasets that could be used for benchmarking

SPARQL Compliance

SPARQL Implementation Coverage Report (results of running the DAWG SPARQL test cases against different RDF stores)

Workshops and Events

Use Cases and Future ideas

Use Cases

Benchmarking Triple Stores

An RDF benchmark suite should meet the following criteria:

The query load should illustrate the followwing types of operations:

If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of ssecondary queries, typically more complex ones.

Full Disclosure Report

The report contains basic TPC-like items such as:

These can go into a summary spreadsheet that is just like the TPC ones.

Additionally, the full report should include:

Test Drivers

OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.

The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.

This can be packaged as a separate open source download once the test spec is agreed upon.

For generating test data, a modification of the LUBM generator is probably the most convenient choice.

Benchmarking Relational to RDF Mapping

This area is somewhat more complex than triple storage.

At least the following factors enter into the evaluation:

The rationale for mapping relational data to RDF is often data integration. Even in simple cases like OpenLink's ODS applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.

A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.

A real world case is OpenLink's ongoing work for mapping Wordpress, Mediawiki, PHPP BB and possibly other popular web applications into SIOC.

Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.

Another "enterprise style" scenario might be to take the TPC C and TPC D databases, after all both have products, customers and orders, and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.

Considering the times and the audience, the Wordpress/Mediawiki scenario nmight be culturally more interesting and more fun to demo.

The test has two aspects: Throughput and coverage. I think these should be measured separately.

The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."

Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.

In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".

It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.

RdfStoreBenchmarking (last edited 2008-05-12 07:56:53 by ChrisBizer)