Because good research needs good data

Credit from citing datasets?

Chris Rusbridge | 19 August 2008

Cameron Neylon has a thought-provoking post on his Science in the Open blog, arising from a discussion he had at Scifoo with Michael Eisen of Berkeley:

"Michael [...] felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute."

This seems a strong argument to me. A paper describing a research dataset isn't really a research paper, surely? So it is a round peg in a square hole. But if the alternative is that the dataset can only be cited via a research paper (on some other, probably related topic) that mentions it in passing, then this is likely to be a rather poor proxy. The dataset creator may get little credit, and the research paper authors rather more credit, than either was due.However, if we move to the situation where more datasets are cited directly, then not only do the dataset creators or providers get the credit that's due to them, but also the credit is the right kind. That is, the citation is recognisably for creating or providing a dataset and not for a specific research contribution. I suppose that papers in the Nucleic Acids Review special issue on datasets are also recognisable as creation/provision papers, not research papers, but few other disciplines have such easily recognisable distinctions. So overall simply using data citations looks like a better bet. As Cameron says:

"So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different."

Amen to that!