SweoIG/TaskForces/CommunityProjects/FOAFWhitelisting

From W3C Wiki

Community Project

  • See also: FoafOpenid -- filtering blog spam using social network.

Whitelisting Email Sender with FOAF

Kjetil Kjernsmo

1. Please provide a brief description of your proposed project.

I propose to create a FOAF and trust metric based system to whitelist email coming from trusted senders and integrate the system into spamfighting tools, MTAs and MUAs.

2. Why did you select this particular project?

Spam is a huge problem that affects almost every Internet user, and so nearly everyone has an interest in novel approaches to fight the problem. Allthough this proposal, like all spam-fighting techniques is not a silver bullet, it is a relatively cheap weapon in the arms race. Moreover, there is allready a lot of FOAF data available, more than 15 million recent estimates say, there are social networking sites that creates a user-friendly way for users to create the needed networks, there are many available tools to do so as well, based on e.g. vCard-based addressbook. Finally, since this will create a way to more safely identify ham, users can be more aggressive on the spam-side techniques without the cost of more false positives.

3. Why do you think this project will have a wide impact?

Since, as mentioned above, it is an approach to a problem that affects a lot of people and since the data to make it useful for a relatively large group of people from day 1 is allready on the Web, it is likely to bring more users to FOAF quickly. That again will quickly make it even more useful, and so we get a network effect.

I feel that I should mention the Konfidi project, that has many of the same goals as my proposal and which aims to strengthen FOAF by using PGP. While we must expect and guard against some spammer attacks, I believe that the use of PGP will make it hard for Konfidi to have wide-spread adoption.

4. Can your project be easily integrated with other wide-spread systems? If so, which and how?

My first priority is to integrate it with SpamAssassin, a wide-spread and award-winning spamfighting system. It can be integrated into the system by creating a plugin. The API for doing so is well documented and it is relatively straightforward. Justin Mason, one of the SpamAssassin founders, have expressed interest in FOAF-based whitelisting, and while the plugin can be distributed on its own, if we prove the case, we might get it in the core SpamAssassin distribution.

It can also be integrated on the MTA-level. E.g. qpsmtpd, a fully programmable SMTP daemon has also a plugin architecture, and I will also write a plugin for this MTA.

Finally, depending on the architecture of the implementation, we might also push for implementation on the MUA level, e.g. Opera's M2, Mozilla's Thunderbird or Microsoft's Outlook.

All these system should be able to use this system, and so it should be possible to get wide-spread support.

5. Why is it that this project should be done right now, i.e. why should people prioritise this ahead of other projects?

The timing is important. For one thing, the SpamAssassin project has a 3.2.0 release coming up shortly, and if we proceed swiftly, we may have a good feature for the new release.

There is also an important development with the spam ifself that indicates that this is the right timing. Increasingly, spammers embed their message in an image attachment, and the message itself contains only text designed to foil Bayesian classification filters. The respons to that is to apply Optical Character Recognition to the image attachments, but this is an expensive operation for mailing systems to do. If we can provide a simple and scalable way to whitelist that includes potential senders beyond the immediate acquintances, so that messages containing images from known sources does not need to be OCRed, it will save substantial resources and drive adoption, as it solves an immediate problem.

Finally, many in the Semantic Web Community has stated that they feel this is an urgent undertaking.

6. What can you contribute to the project?

I commit to create the SpamAssassin and qpsmtpd plugins. I can also help compile the needed data. I will participate in the discussion on how the trust metric needed should be computed. My professional work includes the Opera Community, which creates FOAF data and allows people to grow social networks. On my spare time, I have created XML::FoafKnows::FromvCard, which generates foaf:knows records from vCard addressbooks. Finally, I have an RDF::Scutter module that can be used to gather FOAF data from the web. Edit: JohnBreslin - see also Filster for incoming mail filtering using FOAF.

7. What contribution would you need from others?

I hope that the community can come together and create either a web service or a Perl module (the latter is sufficient for SpamAssassin and qpsmtpd), that can be queried, and given a SHA1-hash of the users' mailbox and a SHA1-hash of the sender's mailbox and return a trust metric. Also, we need to come together to compile and host the data. We must expect that the trust metric computation will need to evolve over time, especially we must expect that spammers will try to make false foaf:knows assertions, and we need to think about how to deal with those without hindering adoption.

8. What standardisation should the Semantic Web community at large undertake to support the project?

None is required at this point.

9. How does your project encourage others not currently involved with Semantic Web technologies to get involved (by providing data or make a coding commitment)?

The network effect, since it will be useful for many people from day 1, given the amount of FOAF data allready out there, would hopefully encourage other data providers to provide FOAF data as a service to the users. Notably, we know that LinkedIn developers have expressed interest in FOAF, and this could be the use case that would make the decision easy.

10. What would be the main benefit of using Semantic Web technologies to achieve the goals of the project, compared to other technologies?

The sheer amount of existing FOAF data. It is the only format or data model that expresses relationships between people in a standardised way that has really wide-spread adoption.

Commitments

If you like this project, please write your name below and indicate what contribution you can make to the project.

ChrisPrather -- I'm willing to help with coding implementations. This falls along a series of ideas I've been experimenting with already.

Kingsley Idehen -- OpenLink will provide, and host, RDF Data Sources etc.. This includes experimentation with our MTA sink drivers (filters and storage drivers) for both Virtuoso and ODBC accessible Databases.

Tom Heath -- A very worthwhile project... I'm currently working on deriving trust metrics from Revyu.com data to support recommendations in social networks. The metrics assume foaf:knows relationships already exist between two Persons, and the purpose of the metrics is slightly orthogonal, but there may be common goals towards which we can collaborate, particularly in the generation of the metrics.

Dave Brondsema (from the Konfidi Project) -- First, we do have plans to support other identity and authentication systems besides OpenPGP. I will help align the Konfidi project with this initiative, in whatever way makes sense. I've also done a fair bit of reading other peoples' research on computational trust modelling and can help in the discussions of how to model trust data and how to implement algorithms for trust computation.

Code

I've committed the first Qpsmtpd plugin and the first Spamassassin plugin to my SVN repository. I haven't tested them, but I think they conform to the respective APIs and should be functional, except they don't have the trust metric yet.

Discussion

Discussion has begun on the foaf-dev mailing list