Wikiproteins...

Chris Rusbridge | 28 May 2008

Genome Biology has an article by Barend Mons, Michael Ashburner et al: "Calling on a million minds for community annotation in WikiProteins". From the abstract:

"WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. "

I'll say just a bit more on the Wikiproteins effort below, but I was also interested in this from the introduction:

"The exploding number of papers abstracted in PubMed [...] has prompted many attempts to capture information automatically from the literature and from primary data into a computer readable, unambiguous format. When done manually and by dedicated experts, this process is frequently referred to as 'curation'. The automated computational approach is broadly referred to as text mining."

I've been increasingly concerned recently to understand better the use of the word curation in this sense, which dates back to at least 1993, preceding our use of the term by a decade (eg 'curated databases' in genomics, etc). We try to cover this sense through the 'adding value' part of our definition ("Digital curation is maintaining and adding value to a trusted body of digital information for current and future use"), although I'm not sure it captures it fully.Back at Wikiproteins, the idea is to combine the two approaches (manual curation by experts and sophisticated text mining). Jimmy Wales of Wikimedia Foundation is one of the authors of the paper, which adds an interesting dimension. The approach is based on "a software component called Knowlets™. [...] Scientific publications contain many re-iterations of factual statements. The Knowlet records relationships between two concepts only once. The attributes and values of the relationships change based on multiple instances of factual statements (...), increasing co-occurrence (...) or associations (...). This approach results in a minimal growth of the 'concept space' as compared to the text space..."This is extraordinarily interesting, and I'm sure we'll hear much more about it in the near future. I particularly like the approach to expert-based quality control. There must be questions about long term sustainability, both organisationally and technically, but sceptics continue to be amazed at the sustainability of other kinds of Open activities!

You are here

Wikiproteins...