IDCC11: Open data driving scholarly communications in 2020

13 December, 2011
Kirsty Pitkin

Professor Philip E Bourne from the Department of Pharmacology at University of California San Diego and Editor-in-Chief of PLoS Computational Biology gave a remote presentation to the conference via Skype to outline what he sees as the future for data publishing.

Bourne opened his presentation by explaining his perspective as a data producer and a data user in a field that has emerged largely as a result of open data. He also warned us that he is suspicious of institutional repositories and supportive of data sharing.

He began with a story which really motivated him. This was the story of Meredith, who submitted a paper on pandemic modelling proposing a very different theory to the general consensus in the field. This paper was eventually sent to Science where it was sent out for review. What makes this unusual is that Meredith was just 15 years old. She had become interested in the subject as a result of a science fair project and began to dig in, but got the data the old fashioned way by reading the papers and bugging the authors for the data. Bourne asked us to imagine what someone like Meredith could do if the data was discoverable openly.

With this in mind as a driver, Bourne gave us his views from his various different perspectives, beginning with his perspective as a data producer. He observed that the volume of data is actually increasing disproportionately to Moore's Law, which presents a problem. He stressed that we really need to start thinking about cost vs benefit, and question what data we can afford not to keep. We have to consider reductionism of data.

The other aspect of data production he felt was under-represented was the idea of the long tail. He noted that in the US a lot of effort goes into worrying about large data, but little attention is paid to the long tail, including the large number of smaller datasets, which are not persistent and typically get lost when people leave.

Bourne moved on to discuss curators and the efforts the PLoS Computational Biology journal are making to pay homage to those who are dedicated to making our research endeavours successful by curating the data. He feels that these are people who need to do more to promote themselves and make those who use their services more aware of what they are doing.

There needs to be more synergy between data and publication.  He used the example of their IDB programme, where a talented group of immunologists read literature and manually enter information into a database taking the data from digital to analogue to digital form.  It is clear that we need structured publication so extraction of data from the literature should happen more automatically without manual curation.

Bourne described the process by which they accept data into a database at the Protein Databank: issue an identifier for that data, which then allows the author to publish in journals – thus providing the incentive. However, once the researcher has the identifier it is difficult to get them to go back and improve that dataset if it is not perfect. He also discussed the problem of data which is “too perfect” which can pass through the system, but is later found to be fraudulent. They are now putting in place systems to deal with this problem. He discussed the parallels between the databank's processes and the process of publication in a journal, observing that the differences really exist in how we perceive the final product.

Next he provided his perspective as a database provider working with the Protein Databank, which is one of the oldest in biology and considered a community-owned resource. Like many of these resources it has grown and and grown in complexity, but this has been done at – in effect – on decreasing cost due to improved data processing which has reduced the costs. However, there is a lot of emphasis from funders for these kind of metrics, and Bourne worries that there may become an equivalent of an H-factor, where researchers may not get funding unless they can show these high usage metrics for their data.

Bourne also discussed some of their mistakes, including failing to impose an ontology and metadata standards on the community early enough, and failing to develop software tools to facilitate this early on. This has come back to haunt the community as a whole as the data has become more complex. They were early adopters assigning DOIs to data, but these are not used in the literature, so there are still efforts ongoing to trying to educate people to use the DOIs.

Bourne has realised that at the moment they maintain structure data, whilst others maintain other forms of biological data, the end result of which is that data tends to be stove-piped around certain types and topic areas. This makes it difficult to address broad biological questions with today's resources.

In terms of privacy, Bourne noted that there is an interrelationship between a piece of data and an IP address, allowing them to see which users look at each dataset. This means they could have an Amazon-type model of suggesting recommended datasets based on previous views. Users are hankering after more performance. However, the use of widgets to pull in data from different sources has been slow to take off.

Bourne discussed the dream of literature integration, which is a testament to full open access. He described work to find database identifiers within texts to look at the context where that data appeared and make associations with other similar data for the user. This also allows the creator of the dataset to see how their work is being used in ways that were previously unavailable. 

Bourne draws from this a dream for what should happen in the future, where the paper becomes only one view of a piece of knowledge.  He suggested a future where mousing over thumbnails in a paper allows you to pull in the source data. If appropriate metadata was deposited with the paper itself, you could regenerate the figure from the data you retrieved in the same way as it was available to the author. You could also manipulate the data with other tools to do things the author never thought about, annotate that view and store it.  This could then trigger you to look at more mashups of the data and drive you to other papers. This dispells the notion that data and knowledge are separate and creates a situation where the paper is one view of the data.

Bourne then moved on to discuss how he feels about data as a user. He believes it is great that we are thinking about data as having more value, but data repositories are broken. He does not believe that the notion of “build it and they will come” works in science, but this is the notion upon which many institutional repositories have been created. He feels that they are not particularly effective. He wants to be able to search broadly and deeply into a series of repositories, which he cannot currently do with institutional repositories.

Bourne also discussed the “High Noon” effect drawn from the days of VCRs when owners often did not know how to set the clock on their devices, let alone how to use the features properly. He stressed the need for very simple interfaces. The journal article interface is fairly common, although he conceded that getting information in is not always that straightforward. Data repositories are worse because they have different procedures and different interfaces. Most repositories strive to be different from each other, which makes usage – and deep usage in particularly – difficult.

He stressed that journals looking to make greater use of services like Dryad is good, but this is really only a stop gap measure. We need seamless access between the data and the journal article. 

Bourne touched on the issue of rewards and incentives, describing their work with the PLoS topic pages to get around the traditional reward mechanisms and create something new. He observed that most scientists don't author for Wikipedia because there is no reward. PLoS gets author a wiki page which gets published in the journal so they get that traditional reward and this becomes the copy of record, but this is then deposited into Wikipedia, which becomes the living version of that content. Their ideas for data pages will be broadly the same.

Bourne concluded by looking forward to the future in 2020.  We will want to have resources that allow us to answer questions rather than just retrieve data.  We need to understand all there is to know about that data, and we need the notion of a data registry.  He also suggested that to get there we need the notion of an app store for data.