This page is for references to signed quotes of deployments of large triples stores rather than predictions of what some software might scale to.
Table of Contents:
Contents
- BigOWLIM (12B explicit, 20B total)
- Bigdata(R) (12.7B)
- Garlik JXT (9.8B)
- YARS2 (7B)
- OpenLink Virtuoso Open-Source Edition (Scale: simply a case of Cluster Size)
- Jena TDB (1.7B)
- AllegroGraph (1B)
- Jena SDB (650M)
- Mulgara (500M)
- RDF gateway (262M)
- Jena with PostgreSQL (200M)
- Kowari (160M)
- 3store with MySQL 3 (100M)
- Sesame (70M)
BigOWLIM (12B explicit, 20B total)
BigOWLIM 3.1 demonstrated reasoning against 12.03 billion explicit statements, loading the LUBM(90000,0) dataset. The reasoning over this data resulted in materialization of additional 8.43 billion implicit statements, thus the total amount of the statements stored in the repository went up to 20.46 billions. Loading, together with inference, took 290 hours, on a single server worth less than $10,000. The average speed of loading explicit statement was 11,543 st./sec.; the speed of storage and indexing is 19,603 st./sec.
BigOWLIM can deal with 1 billion statements on a desktop machine worth $2000: it takes it less than 5 hours to load the LUBM(8000) dataset at average speed of 66,196 st./sec.; loading with inference and materialization takes 14 hours (21 KSt/sec.). The "full cycle" run of LUBM(8000), including loading, inference, and query evaluation, takes 15.2 hours.
More details about the benchmark runs is available at http://www.ontotext.com/owlim/benchmarking/lubm.html
Commercial licene; freely available for research, evaluation and other non-production usage. http://www.ontotext.com/owlim/big/index.html
Bigdata(R) (12.7B)
We are in a shakedown period on the scale-out system and will post results as we get them.
6/30/2009: 1B triples stable on disk in 50 minutes (333k tps). 12.7B triples loaded. The issue with clients dying off has been resolved, as has the high client CPU utilization issue.
5/25/2009: 10.4 billion LUBM triples loaded in 47 hours (61k tps) on a 15 blade cluster (this run used 9 data servers, 5 clients, and 1 service manager). Max throughput was just above 241k triples per second. 1 billion triples was reached in 71 minutes, 2 billion in 161 minutes, 5 billion in 508 minutes. The clients are still the bottleneck and started failing one by one after 7.8B triples (throughput at that point was 141k tps).
5/22/2009: 9 billion LUBM triples loaded in 31 hours using the same hardware. The bottleneck was the clients, which were were not able to put out enough load. By the end of the trial the clients were at 100% utilization while the data services were less than 10% utilization.
5/21/2009: 5 billion LUBM triples loaded in 10 hours (135k tps) on a 15 blade cluster (10 data servers, 4 clients, 1 service manager).
Bigdata is an open-source general-purpose scale-out storage and computing fabric for ordered data (B+Trees). Scale-out is achieved via dynamic key-range partitioning of the B+Tree indices. Index partitions are split (or joined) based on partition size on disk and moved across data services on a cluster based on server load. The entire system is designed to run on commodity hardware, and additional scale can be achieved by simply plugging in more data services dynamically at runtime, which will self-register with the centralized service manager and start managing data automatically. Much like Google's BigTable, there is no theoretical maximum scale.
The bigdata RDF store is an application written on top of the bigdata core. The Bigdata RDF store is fully persistent, Sesame 2 compliant, supports SPARQL, and supports RDFS and limited OWL inference. The single-host RDF database is stable and is used at the core of an open-source harvesting system for the intelligence community. We are working towards a release of the scale-out architecture.
Please come see our presentation at the Semantic Technologies Conference in San Jose on June 18th.
More information on bigdata can be found here:
And in this presentation at OSCON 2008:
http://bigdata.sourceforge.net/pubs/bigdata-oscon-7-23-08.pdf
Open source
Garlik JXT (9.8B)
"We have developed from scratch a large-scale clustered RDF ‘quad’ store. This store, called JXT, is distributed across multiple linux boxes and scales to several billion triples. It has been in production at this scale for over a year now and the early teething problems have all been ironed out. We believe it has the capacity to scale to 60 billion triples. It is also SPARQL compliant. We are fortunate to have Sir Tim Berners-Lee on our Advisory Board so it is not a good idea for us to veer too far away from W3C standards!" -- Tom Ilube, CEO of Garlik
The store is called JXT. Currently we have 4 KBs of 1.6-1.7GT each loaded in our production systems. Loading time for one 1.7GT KB is about 8 hours, but it's an interactive process that involves running lots of queries, and doing small inserts.
In testing we've gone to 3.2GT, but we've not used that much in our production environment. It seemed fine though.
Update: as of 2008-05-28 it's running with 9.8B triples in a production cluster to power the DataPatrol application.
Proprietary, not distributed.
YARS2 (7B)
YARS2: A Federated Repository for Querying Graph Structured Data from the Web describes the distributed architecture of the YARS2 quad store. With scalability experiments up to 7bn synthetically generated statements - LUBM(50000).
Proprietary, not distributed.
OpenLink Virtuoso Open-Source Edition (Scale: simply a case of Cluster Size)
As of June 26, 2009, the LUBM 8000 load speed is 110,500 triples-per-second on a single machine with 2 x Xeon 5410 and 16G RAM. The software is Virtuoso 6 Cluster, set up with 8 partitions. No inference is made. In comparison, Bigdata reports 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.
LOD Cloud Cache is a live instance with more than 5 Billion Triples (and counting) on a 2-blade cluster with 16 share-nothing Virtuoso cluster nodes (processes). Blade configuration: 2 x dual cores, 2 GB Ram, and 4 Sata controllers (Hard Disks).
Towards Web-Scale RDF white paper discusses why Triple Scale is function of cluster configuration. 100 Billion Triples with sub-second response time can be achieved with the right cluster configuration.
Benchmarks data sources
Older comments
New Bitmap Indexing white paper shows how OpenLink Virtuoso handles loading the 1 billion triple LUBM benchmark set with a sustained rate of 12692 triples/s and the 47M triple Wikipedia data set at a rate of 20800 triples/s. Kingsley Idehen, OpenLink Software.
"The single query stream rate with 100K triples is 14 qps at 100K triples and 11 qps at 1G triples" -- LUBM and Virtuoso
Open source.
http://virtuoso.openlinksw.com/wiki/main/
Jena TDB (1.7B)
TDB is a persistent graph storage layer for Jena. TDB works with the Jena SPARQL query engine (ARQ) to provide complete SPARQL together with a number of extensions (e.g. property functions, aggregates, arbitrary length property paths). It is a pure-Java, employing memory mapped I/O, a custom implementation of B+Trees and optimized range filters for XSD value stapces (integers, decimals, dates, dateTime).
TDB has been used to load UniProt v13.4 (1.7B triples, 1.5B unique) on a single machine with 64 bit hardware (36 hours, 12k triples/s).
TDB 0.5 Results for the Berlin SPARQL Benchmark (August 2008).
Open Source: License: BSD
AllegroGraph (1B)
"This is with version 1.2.4. Our performance numbers are published on our website." -- Steve Sears
Commercial, limited free version.
http://agraph.franz.com/allegrograph/
Jena SDB (650M)
SDB is a new SPARQL database for graphs/named graphs for Jena. Can load UniProt (650M). Uses PostgreSQL, MySQL, Oracle or MS SQL Server. Also, HSQLDB and Apache Derby.
Open source
http://jena.sourceforge.net/SDB/
Mulgara (500M)
"The Mulgara triple store is scalable up to 0.5billion triples (with 64-bit Java)" -- Norman Gray
Open source
RDF gateway (262M)
"the UniProt protein database (262 million triples) and RDF Gateway." -- Geoff Chappell, Intellidimension
Commercial
http://www.intellidimension.com/
Jena with PostgreSQL (200M)
"Our store is pretty big -- its about 200M triples.
We're currently using Jena on Postgres. For our needs this worked out better than Jena/MySQL, Sesame, and Kowari." -- Leigh Dodds, Ingenta
Open source
Kowari (160M)
"My own testing has been in the 10-20M triple range." -- Chris Wilper
Addendum from Chris on Nov 7th, 2005: Since this was written, we have successfully loaded over 160M triples into Kowari on a 64-bit machine with 6GB physical memory. A 64-bit machine is really required to bring Kowari up to this level because it uses mapped files and needs a lot of address space. In our experience in this environment, simple queries still perform fairly well (a few seconds) and complex queries involving 8-10 triple patterns perform worse (a few minutes to an hour).
Open source, unmaintained (See Mulgara fork).
3store with MySQL 3 (100M)
"The store my consortium produces (3store) is used successfully up to 100M triples or so. Beyond that it gets a bit sketchy. I'm currently looking at ways to make it scale to 10^9+ without specialising the store to a particular schema."
More specifically, one user is running it with 120M triples in MySQL 4.1. At that size query works fine, but assertion time is down to about 300 triples/second, which makes growing it any bigger too painful. I should note that 3store is an RDFS store, in version 3 it's possible to disable the inference, which should make it scale to much larger sizes, but there are plenty of other stores that can run vanilla RDF storage well. -- Steve Harris, AKT
Open source (GNU GPL).
http://threestore.sourceforge.net/
Sesame (70M)
(10-20 million triples) " is a lot, but most serious triple stores can handle this I'd say. Sesame certainly can, ..." -- Jeen Broekstra, Aduna
Addendum from Jeen on Feb 10 2006: The above comment should be taken as a minimum of what the store can handle. We recently have ran a few scalability tests on Sesame's Native Store (Sesame 2.0-alpha-3). Using the Lehigh University Benchmark we successfully added a LUBM-500 dataset (consisting of about 70 million RDF triples). The machine used was a 2.8GhZ P4 (32-bits) with 1GB RAM, running Suse Linux 10.0 (kernel 2.6), Sun J2SE 1.5.0_06. Upload took about 3 hours. Query performance on the LUBM test-queries was adequate to good: unoptimized, the worst query (Q2) took 1.3 hours to complete, but most queries completed within tens of milliseconds (Q4,5,6,7,8,10,12,13) or 1-5 minutes (Q1,3,9,11,14) - though some of these queries are just fast because they return no results (the native store does no RDFS/OWL inferencing). We have yet to explore larger datasets and performance using RDFS inferencing but it seems that 70M is not the ceiling and that Sesame can easily cope with even larger sets, especially when we use bigger hardware. But that's prediction not fact so I'll leave it at that for now
Open source.
Others who claim to go big
Claims without signatures or quotes. Please move them from this section when they can be linked to a signed specific capacity measurement.
http://wiki.apache.org/hadoop/HRDF -- Edward yoon
- oracle 10g - if one does it big, oracle would, or? --
Questions
I know storing 200M triples is cool. But which store can handle simultaneous queries of about 10.000 users using RDFS inferencing? -- Anonymous
200M-300M or so seems to be about the max that anybody has reported. It would be very helpful if people could state whether they tried to scale further, and if not able to, what the problems were -- i.e., does it become too slow to add, perform trivial queries, perform complex queries, all of the above, etc. Additionally, it would be extremely helpful if hardware specs were included. Anyway, this is a great resource. -- CS
It would be nice of the postings here comment on the level of inference supported. Loading with forward-chaining and materialization is *much* heavier than just loading the data. The more general question is what part of the semantics of the loaded ontology/dataset is supported by the system. There are subtle differences in what "loading" means for the four systems with highest results above. RDF gateway supports the semantics of UNIPROT through backward-chaining. OWLIM supports the semantics of LUBM through forward-chaining. The sort of reasoning required in the UNIPROT load of RDF Gateway is much more complex than the one necessary for passing LUBM. Finally, Virtuoso and AllegroGraph are fairly undetermined with respect to reasoning involved in the experiments they report on. For instance, Virtuoso reports results on LUBM but says nothing about the completeness of the query evaluation. -- Atanas Kiryakov
Related
RdfStoreBenchmarking (page collecting benchmark results)