SWEO Community Project: Linking Open Data on the Semantic Web
Datasets
This page collects RDF datasets that are part of the Semantic Web.
For being part of the Semantic Web data has to be accessable as RDF over the HTTP protocol though at least one of the access methods listed below. The more methods the better (but avoid aliases). See also tutorial on How to publish Linked Data on the Web.
The page is part of the SWEO Interest Group Community Projects effort.
Datasets available with dereferencable URIs - LinkedData
Example things are starting points for use of RDF browsers.
Semantic Web Community Wiki Public Semantic MediaWiki featuring Linked Data views and a SPARQL endpoint.
Craigslist as Linked Data. See for details.
U.S. Securities and Exchange commission's EDGAR database available as Linked Data and via SPARQL endpoint.
YAGO ontology available as Linked Data. The ontology should be interlinked with DBpedia shortly.
ISWC and ASWC 2007 Conference Data The data set contains data about tracks, papers, sessions, talks, workshops, tutorials, invited talks, panels, organisers, people, organisations and topics. The data is available as Linked Data, SPARQL endpoint and as RDF dumps.
RKB Explorer Data 25 different domains, each with a separate dataset. The data sets are focused on scientific research, and the larger ones include DBLP, Citeseer, CORDIS, NSF, EPSRC, RAE2001 as sources. The data is available as Linked Data, SPARQL endpoint and RDF dumps, and a simple browser is provided. Semantic Web Sitemaps provided.
Musicbrainz provides lots of data about artists and their albums. Servered as Linked Data and via a SPARQL endpoint.
lingvoj.org provides URIs and multilingual labels for hundreds of human languages. Example entries:French language, Chinese language.
Wikicompany is a free, worldwide business directory that anyone can edit. OpenLink Software hosts a Linked Data version of the directory, extracted by the DBpedia team using DBpedia's software. Example entries: Northwest Airlines, Apple Computer, OpenLink Software.
Christian Becker's flickr wrappr pulls photos related to DBpedia resources from flickr and serves them as RDF. Example: Paris
Joshua Tauberer's GovTrack.us publishes linked data about members of the U.S. Congress, as well as bills, committees and votes. 12M triples. Example resources, announcement
US Census RDF version of the 2000 US census dataset. Consists of around 1 billion triples. Served as linked data and via a SPARQL endpoint. Example things: USA New Jersey
WordNet is a large lexical database of English. Currently being RDFized by a W3C Best Practices Task Force. Details ... Example thing: the verb "read" in the first sense
DBpedia: Linked Data version of Wikipedia. The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. Provides descriptions in 12 different languages. Altogether, the DBpedia dataset consists of 103 million RDF triples. The dataset is interlinked with various other data sources. Example things: Paul McCartney, Berlin, Tetris
Open Cyc Semantic Web version of the Open Cyc ontology. Supports content negotiation on concept URIs. Example things: RetailStore, Dog. Concept Browser
DBLP Bibliography Server Berlin: Provides bibliographic information about scientific papers. Size of the dataset. 800.000 articles and 400.000 authors, aprox. 15 million triples. Example thing: Tim Berners-Lee in the bibliography. The server provides the November 2006 version of the DBLP dataset. As the Hannover DBLP Bibliography server is updated weekly, you should set RDF links to this server and not the Berlin one.
DBLP Bibliography Server Hannover: Derived from the FUBerlin server, but with more links between the publications (e.g., to conference series) and updated weekly. Unfortunately, no backward compatibility with regard to URIs (URI for persons do not include numbers anymore). Example thing: Tim Berners-Lee
RDF Book Mashup: Provides bibliographic information, reviews and sales offers for most books that have a ISBN number. Maps data from Amazon and Google base to RDF. Size of the dataset: Unknown, billions of triples. Example thing: "Weaving the Web", the book
Project Gutenberg Catalog Linked data version of and SPARQL endpoint over the Project Gutenberg catalog. Interlinked with DBpedia. Example author: Ed Krol
Gene Ontology Annotations Chris Mungall (Berkeley Drosophila Genome Project) serves 6 million annotations from Gene Ontology database
Gene Fruitfly Embryogenesis Images Chris Mungall (Berkeley Drosophila Genome Project) serves a database containing annotated images of gene expression in fruitfly embryogenesis.
IS-Group@Freie Univeristät Berlin There is RDF data about the activities and members of the IS-Group at Freie Universität Berlin available. Example thing: DOAP description of D2R Server project
ECS School Southampton Serves data about members, projects and seminars on the Web as Linked Data. Example person: Marcus Cobden
MindSwap There is RDF data about the activities and members of the Mindswap group at Maryland available.
Revyu has reviews and ratings in RDF/XML available via dereferencable URIs and a SPARQL endpoint. FOAF and Tag information is also available by the same mechanism.
ESWC2006 Conference Dataset describes many aspects of ESWC2006, according to the ESWC2006 Conference Ontology describing authors, papers, session and workshops. Mostly available via dereferenceable URIs. The data might need checking over, and it's not a huge number of triples, but is also well complemented by similar data sets from ISWC2006.
ESWC2007 Conference Dataset describing authors, papers, session and workshops. Available as Linked Data, HTML and via a SPARQL endpoint.
geonames INformation about over 6 million places and geographic features. Example thing Berlin
Several community site with FOAF-enabled profiles — see table at the FOAF wiki
UniProt provides a large life sciences data set with 300M+ triples (contact Eric Jain for a login)
OpenGuides are a network of wiki-based city guides. Example Open Guide to Milton Keynes Each node has RDF/XML describing the thing the node is about, in addition to wiki versioning information. URIs might need tidying up, and don't currently support 303 redirects.
Advogato is exporting its users profiles using FOAF.
Robots.net is exporting its users profiles using FOAF.
TalkDigger is exporting its users profiles using FOAF and the conversations data using SIOC (note: some problems should be resolved between the sioc Users and the FOAF profiles).
Locationary provides geographic information from different information sources. Still prototypical.
dbtune provides linked data access for the Jamendo Creative Commons music platform, the Magnatune label, the BBC John Peel sessions, the MySpace data and the AudioScrobbler data. It also hosts a version of Musicbrainz powered by D2R, and interlinked with Lingvoj and DBpedia.
SemanticWebCentral is a software development site for Open Source Semantic Web tools (think SourceForge for the Semantic Web). It publishes information about its projects and developers in RDF, using the GForge ontology.
Semantic Web School - Vienna: The Semantic Web School provides the latest information on issues about the Semantic Web in form of it's d2r mapped press collection with glossary, wikilinks and so forth using the d2r-server and rss features.
Jamendo Music server exposing Artist, albums, tracks, covers, lyrics, tags, P2P links (bittorent, ed2k)
CIA Factbook D2R Server publishing the CIA Factbook. Example thing: Botswana
Bio2RDF Semantic web atlas of postgenomic knowledge about human and mouse.
Eurostat Countries and Regions D2R Server publishing statistical information about European countries and regions. Example thing: Leipzig. See also LOD Eurostat page and the alpha release from this project.
News about the Semantic Web provided by the Semantic Web School Austria.
doapspace.org 43,000 DOAP profiles of Freshmeat projects, 15,000 SourceForge projects, 1,720 Python Package Index projects and hundreds of spidered DOAP.
Open Archives Demo showing how a OAI-PMH endpoint is exposed as Linked Data with OAI2LOD server.
BBC Later and Top of the Pops Data about episodes and tracklists. Interlinked with MusicBrainz and DBpedia.
MySpace wrapper This service provides a live RDF representation of Myspace users. If the user is also an artist, then the corresponding tracks in the streaming audio cache are included in the RDF.
LastFM wrapper This service provides a live RDF representation of your last 10 tracks submitted to AudioScrobbler/Last.fm
overdogg.com Allows users to post needs and wants and expose them to the semantic web, provides matching making with qualified providers. Scrapes craigslist want ads for FOAF and TIWAN metadata (currently > 100K docs and users). Ads are exposed as linked data (RDF)
See also http://esw.w3.org/topic/AnRdfHarvesterStartingPoint
Datasets available as RDF Dumps
QB's Quotes RDF contains at least 42,000 famous quotations with author and subject, from Quotations Book
SIMILE Data Collection containing various datasets including CIA's World Factbook, Library of Congress' Thesaurus of Graphic Materials, National Cancer Institute's cancer thesaurus, Web Consortium's Technical Reports.
dbpedia: Dataset containing extracted data from Wikipedia. About 1.6 million concepts described by 91 million triples, including abstracts in 10 different languages.
GovTrack.us RDF data about the U.S. congress
U.S. Census data comprises population statistics at various geographic levels, from the U.S. as a whole, down through states, counties, sub-counties (roughly, cities and incorporated towns), > 700 million triples.
UniProt provides a large life sciences data set with 300M+ triples
SwetoDblp ontology focused on bibliography data of publications from DBLP with additions that include affiliations, universities, and publishers
Wikipedia³: 47 million triples containing extracted metadata from Wikipedia.
Chef Moz: 290344 restaurants - 104856 reviews - 59243 links to reviews - 2402 editors available as RDF under a free license.
DOAP Store prodives daily generated dumps with all its DOAP project descriptions. RDF/XML, N3
Rpm Find - This is freely downloadable from http://rpmfind.net/linux/rpm2html/mirror.html. The RDF data expands to about 1.3GB - not sure what that equates to in numbers of triples.
Open Directory - this is the classic RDF? source but historically has had some problems with RDF correctness. http://rdf.dmoz.org/
Music Brainz - this service dumps its data as RDF fairly frequently at ftp://ftp.musicbrainz.org/pub/musicbrainz/data/. Currently the zipped version of this data is 102MB
Bitzi - a collaborative file describing service. Dumps data as RDF here: http://bitzi.com/openbits/datadump. The data consists of 330,026 discrete files, 270MB uncompressed.
Texai Lexicon - This is a machine readable dictionary derived from WordNet 2.1, Wiktionary, the CMU Pronouncing Dictionary and the OpenCyc lexicon. Each lexicon word sense entry contains links back to the source dictionary entry, and also to OpenCyc if the entry is has been mapped to the Cyc ontology.
doapspace.org/export - All 55,000+ DOAP profiles available as RDF/XML DOAP. Do what you may with it. This includes all DOAP created by doapspace and all DOAP spidered. XML/RDF tarball
Lots of others. Please feel free to add plenty
Datasets available via SPARQL Endpoints
See Collection of SPARQL Endpoints
Datasets you can RDFize yourself
If you have some data that needs to be RDFized, and wonder how, look also here:
- RDFImportersAndAdapters lists software projects that convert data to RDF
Datasets currently being RDFized
MusicBrainz. Please ask Frederick Giasson for details.
GEMET. GEMET is the GEneral Multilingual Environmental Thesaurus of the European Environment Agency. Please ask Bernard Vatant for details.
Craigslist. See overdogg.com and TIWAN. Also planning for Myspace and Facebook want ads. Contact Sherman Monroe for details.
Datasets that would be nice to have on the Web of Data
Lots. Please feel free to add plenty
MetaWeb Freebase Metaweb has started to publish data dumps of the complete Freebase dataset under Creative Commons Attribution (CC-BY) license. The dumps are in a triple format (not RDF) and Metaweb will update the dumps every three months from now on. It would really be exiting to turn these dumps into RDF, publish them on the Web as Linked Data and interlink them with data sets from the LOD cloud. For instance, interlinking them with DBpedia should be very easy as both datasets contain Wikipedia article identifiers. If somebody is interested in doing this, please contact ChrisBizer.
Open Library project that builds a open, digital library that is supposed to contain all books that have been published. Simple data model so wrapping it should be easy. See also Frederick's post on the open library and the BIBO ontology
U.S. Census Tiger/Line data on roads, zip code geography, places, etc. See also LOD Eurostat page(there is some overlap with Geonames)
IMDB Data. Not sure of the licensing terms. Source. Can be converted to MySQL using JMDB. Source
Open University Course Units. See LabSpace for an idea of what is available, currently in OU-specific XML wrapped in a zip file
GCIDE_XML. The GNU version of The Collaborative International Dictionary of English (Webster's). Available now as XML. Source
Internet Archive. Provides multiple interesting datasets.
Library of Congress Catalog. Provides information about books and millions of other digital assets.
FreeDB. A database to look up CD information using the internet. Source
Peter Skomoroch: Some Datasets Available on the Web
Papers and Web Resources on serving Data on the Semantic Web
Tim Berners-Lee: Linked Data
Tim Berners-Lee: Browsable Data
Alistair Miles et al.: Best Practice Recipes for Publishing RDF Vocabularies
Christian Bizer, Richard Cyganiak, Tom Heath: How to publish Linked Data on the Web (Tutorial)
Ding, Finin: Characterizing the Semantic Web on the Web
Frederick Giasson: Distribution of semantic web data
Richard Cyganiak: Debugging Semantic Web sites with cURL
Frederick Giasson: RDF dump vs. dereferencable URIs
Henry Story: I have a web 2.0 name ! together with Foaf enabling an enterprise and a discussion of the posts by Richard Cyganiak.
ESW Wiki: DereferenceURI
ESW Wiki: SparqlEndpointDescription
Francois Belleau: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System
Frederick Giasson: Content negotiation: bad use cases I recently observed
Nolonger available
Roller Blog Entries: There was a D2R Server running at http://roller.blogdns.net:2020/ which exported blog posts from a Roller Blog Server using the AtomOWL vocabulary. See SPARQLing Roller for details. The D2RQ mapping file should still be useful.
Related Weblogs
Related Wikis
Related Shared Bookmarks
Related Feeds (RSS or Atom)
Related Items on Flickr