Because good research needs good data

Publishing Open Data Working Group

Back in September 2010, BioMed Central (BMC) issued a draft position statement on open data. This statement set out the benefits of open data and discussed the practicalities of making it happen: selecting data for release, when and where to release it, and how to license it. Since then, the...

Alex Ball | 25 June 2011

Back in September 2010, BioMed Central (BMC) issued a draft position statement on open data. This statement set out the benefits of open data and discussed the practicalities of making it happen: selecting data for release, when and where to release it, and how to license it. Since then, the argument for open data and reproducible research has been gaining traction, but there has been comparatively little movement by publishers to put it into practice.

BMC therefore convened a working group of authors, editors, publishers and funders to discuss three important issues that, if settled, would enable significant progress to be made towards open data in the life sciences.

  1. What would be the best editorial policy and process for clarifying the IP status of data, encouraging release into the public domain while acknowledging that licenses may sometimes be needed?
  2. What does publishing data alongside papers mean for peer review?
  3. What are the potential pitfalls to avoid when implementing this policy and process?

The working group met for the first time on 17 June. There were 19 of us there in total, including two from the DCC. I think it would be fair to say that we were all enthusiastic about the possibilities for open data, but well aware of the challenges facing our constituencies in making it happen. This was immediately clear as we tried to answer the first question.

The line of reasoning was essentially this. The ideal scenario would be for a paper and its underlying data to be published simultaneously, with the paper retaining its normal copyright status, but the data explicitly marked as public domain. Not only data published in separate files, but also data that could be extracted from the paper itself. So it would be good for this to be standard editorial policy.

There are, however, barriers to this. Some datasets cannot be shared at all, for reasons of privacy, confidentiality and data protection. In some disciplines, the data sharing culture just isn't there. At the very least, then, journals would have to be prepared to waive the public-domain requirement for data. The question is, should journals demand a good reason for waiving the requirement (cue deliberation over what the good reasons are), or simply waive it on request? And if the latter, would that make a nonsense of the requirement? Or could the mere fact of having the requirement provide the impetus for cultural change? I can't speak for everyone there, but I found myself optimistic that it could.

One of the concerns that editors and publishers had about supplementary data was putting them through peer review. Such a concern was implicated in the decision taken by the Journal of Neuroscience in 2010 to stop publishing supplementary data on its Website. The problem is that readers would typically expect peer-reviewed data to be free from errors, but reviewers do not have the time to review data in that much detail. Even where journals engage statistical specialists to review papers, they check the validity of the reported statistical methods but not the actual data.

While the Journal of Neuroscience route is one solution to the problem, a less extreme approach would be to manage expectations a bit better. Even if a full review of the data is impossible as part of the publication process, journals may be able to manage basic checks: that the data are actually in place, that the column headings make sense, that the formats are not obscure, and so on. So long as authors, reviewers and readers alike are all aware of the standards set for peer-reviewed data, there should be no problem with providing this reduced service.

The discussion of the third issue – best practice and lessons learned with regard to implementing data sharing policies – was wide-ranging and informative. Among the topics discussed were how an editorial policy on data sharing might be enforced, and whether requiring authors to include a data sharing statement in their papers might foster a culture of compliance.

The day's discussions threw up quite a few other important points about publishing data. Among the issues that piqued my interest were the following.

  • How do you define what counts as data? Matt Cockerill, Managing Director of BMC, had an intriguing suggestion: that the defining quality of data was that it can be combined, as opposed to merely aggregated.
  • One of the reasons authors might be wary of dedicating their data to the public domain is that this removes their legal right of attribution. It does not mean, though, that community norms with regard attribution and plagiarism cease to apply. There is perhaps a role for publishers to remind authors of this.
  • Are journal Websites the best place to publish data? We certainly did not come up with a straightforward answer to this one. My personal take on the issue is that they are as good a place as any in the short term, and certainly better than relying on authors' personal storage, but if data are likely to remain relevant far into the future, they would be better placed in a specialist data centre.

All in all, I was impressed with the positivity and passion of the working group for bringing about a common culture of open data. I will be greatly interested to see the effect it has on journal policies in the coming months and years.