SweoIG/TaskForces/InfoGathering/PortalPlans

Semantic Web Information Portal

The Semantic Web Information portal is the result of the Information Gathering Task. It will outlast the SWEO interest group and be managed by editors and contributors both by W3C and outside. Access to the data is free to the web, and all RDF sources are open.

The portal has a start page that gives access to the most important, hand-edited lists (see above). Also, search using fulltext and an ontology is possible. Tools and information items are related to each other, the portal allows browsing through the collected items. Lists are generated automatically based on RDF data.

The rationale for creating this portal in the described way is to guarantee that the portal will continually grow and be kept up-to-date. It has to be a social website allowing comments and ratings to attract users and prolong the duration of SWEO. LeoSauermann is aware that this is a lot of work, but we also have a lot of hands and months of time left.

Portal Layout

/Designs - Possible portal layout and design

Lists on the portal

Gathering as much resources as possible will lead to information overflow, and though the lists can now be created automatically, they may contain too many, outdated or wrong information. There will be some important lists that help finding important information. These lists have to be edited and compiled by hand, as the existing HTML lists used to be. Some users will be able to edit these lists, either using the SWEO portal or by keeping the list on their homepage using RDF.

Such hand-compiled lists will include

links to specifications
Tutorials for beginner
Popular RDF APIs
best Success Stories / Use Cases
upcoming events

note that all these lists are based on a hand-made selection of information items, the classes "specification", "tutorial", etc exist in the ontology

Automated lists can be created based on ratings by the users or the ontology:

recent changes, recently added
most popular tools (by clicks, by rating)
rss feeds for everything
all specifications | tutorials | books | success stories | etc.

To best facilitate our mission of educating and reaching out to different audiences about the Semantic Web, it is important that the portal accomoplish the following goals:

Identify the best SW resources. as determined by portal editors, SWEO, or a community
Associate the resources with appropriate audience and industry (domain) facets, to allow people to drill down quickly to material most appropriate for their technical level, domain familiarity, and goals

Danny Ayers wrote more on these aspects of information gathering. (See also the surrounding thread.)

a Social Portal

The existing effort invested in lists should be continued. People that already managed a list outside of SWEO will be able to continue so with the Semantic Web Information portal. Continuous editing and management of the lists is needed, even when SWEO stops or when people end their involvement. So it is important to setup accounts and editor roles.

Named users and visitors of the portal can rank information items by giving stars. Also, the view count of information items is measured. Based on this information, popular items are ranked higher in searches and in result lists.

People and visitors should be able to add comments to information items. For this, an account at the portal is needed.

Authors of information items are able to upload descriptions of their information items, using standards like DOAP. Alternatively, they can register at the portal and enter the metadata about their information item by hand. DOAP or DC descriptions can be auto-generated by the portal, allowing a Web 2.0 way to reuse the data.

Management of Information Items

Items are entered either into the portal, using a web interface, or read from RDF descriptions managed by others. Web interfaces on the portal allow

entering information items
editing information items
rating
creating and managing lists for special causes.
managing the ontology used to organise the portal (classes like 'tool', 'tutorial', also topics, etc).

For importing data from RDF sources, we propose a RDF vocabulary (see below) how to describe lists of resources. We inform the authors of existing lists about this format and propose the authors of these lists to publish their lists using it. Once an external author has registered the URL of his RDF with the sweo data importer, SWEO Portal will continually read it (for example, daily) and update the SWEO db.

SWEO will provide a tutorial and a web tool that helps others to create RDF files that are compatible with the portal. Similar to Foaf-A-Matic, people can create the RDF files themselves, upload them on their website and trigger the portal to crawl them. The same is needed for lists of information resources. People can create lists of their favorite RDF tools and publish them anywhere on the web, later crawled by the portal. By reusing data managed on other websites, we hope to motivate external people to keep their lists up-to-date.

We may start this data importing only supporting plain RDF/XML files as input, adding more complex input formats (RDFa, GRDDL, etc) when needed.

Copyright of Information Items

Who has the copyright on the managed information descriptions? It shouldn't be a big problem, as the texts are rather short (a summary of a website) and the lists are not very creative work or innovative. It may be enough to assume one copyright for all content. It is important that the data gathered on the portal is reusable in other contexts (by Semantic Web agents). A suitable license would be cc-attributions license (allowing commercial use and derivations).

Alternative suggestions to handle copyright:

Items aggregated to the SWEO Information Portal can be licensed under a creative commons license. (This may be a little complicated to implement). For this, each author can choose a license.
all data is considered to be of one license. People who register URLs for being automatically imported have to check a box saying "the feed conforms to the license".

People suggested for editing information on the portal

in randomized order

PasqualePopolizio
Dave Beckett - could be asked to manage the list of tools and keep the list of most popular ones
Jeff Pan - could be asked to matinain some data
LeoSauermann
IvanHerman
LeeFeigenbaum
Chris Bizer and Daniel Westphal manage a very detailed list of tools

Possible institutions to create and maintain the portal

W3C In my opinion, option 1 isn't ideal because the W3C isn't technology independent enough in this space - Ironic?! Running the final version on a W3C site is subject to internal discussion at W3C, primarily on the issue of long term management, manpower, etc. Unfortunately, W3C's finances are thin at the moment.:-( The technology used to implement the portal should be open source and/or public domain for this option, but that should not be a major problem.

University members of W3C I'm talking about perception, rather than fact. Option 2 isn't ideal because the process for change 'may' be too laborious.

Industry Members of W3C or other companies

Paul Walsh from Segala.com offers to invest a designer and some manpower for this.
- http://lists.w3.org/Archives/Public/public-sweo-ig/2007Feb/0114.html
Kingsley on behalf of [[OpenLinkSoftware|OpenLink] Software] offered initial host machine and Quad Store that will provide SPARQL Endpoint and RDF archive access point.
Benjamin Nowack already makes http://rdfer.com/ perhaps he could join/collaborate/host

Some sort of a loose and independent management based on community effort If a suitable server can be found to manage the server for a longer period, a volounteer/community based management is also feasible, or at least worth thinking about...

For all cases a suitable URI for the portal should be found, defined, and registered as soon as possible. This can then be pointed at specific servers later, we may even move it from one solution to the other, but the URI must be stable...

Expected Traffic

What traffic do we expect? Some made-up numbers...

~ 5 people editing per day
~ 100 visitors per day, if it explodes, up to 10.000 visits per day. (depends if the semantic web is a success)
about 5.000 people subscribing to the RSS feeds. (everyone from ourcommunity) This could be heavy, but easy to cache.
500.000 hits one day if we get slashdotted (or?)

Technical Realisation

LeoSauermann proposes to use plain PHP for this website. The rationale for PHP is, that it can be deployed everywhere. Ruby, Java or Python do not allow this. We should also restrict ourselves to MySQL for storage (comments from Kingsley Idehen: why?).

Suggestion: develop the portal as a PHP based open source project on sourceforge.net. This would show how Semantic Web technology is used and it would also allow people like Chris Bizer or others to contribute easily to the code. Also, others can reuse the Portal for their respective community (for example, xml people, gardeners, surfers, whoever needs a semantic web 2.0 information gathering portal).

An existing content management system should be reused. Additional plugins can then be created for adding the "crawled" data from RDF sources.

We need an existing content management system to be adapted, suggestions?

Drupal (DannyAyers)
- LeoSauermann: drupal sounds like a good start!
Plone
Typo3 CMS
ALOE project (under development at DFKI)
- is a learning object repository,
- development: [1]
- required metadata fields are: title, tags, description, format, license-rights,
- problem: we have more types (people, tools, events). Additional metadata are stored as text, to show the metadata of these things, the user has to click.
- chance: rating, multiple users, personalized selection works good, because this thing is for learning objects, e-learning.
- chance: if we need special metadata, it can be customized
- screenshot available on demand from LeoSauermann
Build the thing from ground up based on an RDF API
- Use OpenLinkSW Virtuoso Server & its template language [2]
- use a conventional MySQL database for ratings and lists, expose it as RDF using Chris Bizer's D2RQ
use Virtuoso RDF Quad Store and [[OpenLinkDataSpaces|OpenLink] Data Spaces] (ODS) platform.
- a must-have is a SPARQL endpoint to all data
- Note: Virtuoso is a Hybrid Data Mangement System for SQL, RDF, XML, and Free Text. It support all data access apis (ODBC, JDBC, ADO.NET, OLEDB, XMLA). It is also a Virtual Database that can expose ODBC and JDBC Data as RDF via RDF VIEWs. It also produces RDF from non RDF Web Data Sources (Pages, Feeds, Web Services) this also includes embedded Semantics like RDFa and eRDF
a data crawler suggestions: Virtuoso (as per above re. RDF from none RDF Data Sources)
A set of ontologies
- /DataVocabulary - a RDF vocabulary for the information items and lists of them.
- /ClassificationOntology - an ontology to classify information items (give them types)
a web interface for the lists. suggestions are:
- exhibit by Simile
- OpenLink's ODS Platform
- OAT Toolkit (collection of RDF aware Controls and Data Access Frameworks)