Because good research needs good data

Access to Citation Data

On 14 May 2013 I and a group of other interested parties met to discuss the findings of a new Jisc-funded study that considers the case for open access to citation data.

Alex Ball | 30 May 2013

Back in 2012, Jisc commissioned a study of the costs, benefits and risks associated with collecting and analysing citation data. That study is now nearing completion and will be published very soon. As a precursor to that, on 14 May 2013 I and a group of other interested parties went along to the Jisc offices in London to discuss the findings and their implications.

The study has been conducted by Curtis+Cartwright, and Geoff Curtis was there to present the draft report. It identifies three use cases for analysing citation data:

  • identifying important papers and researchers in a field of study (for populating a literature review, or approaching potential collaborators);
  • managing the performance of researchers, departments or funding programmes;
  • providing business intelligence for, say, choosing journals to publish in, or evaluating research impact.

These use cases are satisfied (at least partly) by information on the papers citing or cited by a given paper, and metrics such as a journal's Impact Factor or a researcher's h-index. While the cited-by information is trivial to collect, the rest require comprehensive oversight of the literature, which is where Abstracting and Indexing services come in.

The big players in this area are of course Thomson Reuters' Web of Knowledge, Elsevier's Scopus, Google Scholar and Microsoft Academic Search, with an honourable mention for CiteSeerX. It tends to be the case that you get what you pay for with these services. Google Scholar and Microsoft Academic Search are populated with little human intervention and therefore give low quality data; it is also unclear what their scope is, and how stable and sustainable they are. Web of Knowledge and Scopus are better curated, and therefore give better results, but their cost reflects that.

Does this mean everyone is catered for? This was considered carefully at the meeting and I won't claim consensus was reached. What was agreed is that it is very hard for new players to enter the market, because of the critical amount of data you need before analysis becomes worthwhile. For this reason, the study did not recommend spending public money on extracting citation data from existing literature, but instead on making citation data from new publications as visible and useful as possible. This means, for example, making papers open access and using ORCIDs to identify authors.

The hope is that doing this would make entry into the market easier in future, at least for some disciplines. A vision of what might be achieved was provided by David Shotton and the Open Citations Corpus. This resource now contains around 40 million references, enough to be able to automatically correct and deduplicate data. Indeed, this error correction function sounds like it is opening doors at data providers like CiteSeerX and CrossRef. For researchers, the advantages of an open corpus include the ability to correct mistakes in one's own data, an ability to semantically describe references (is the work being cited as a foundation stone of the current paper, or as an example of bad practice?) and freedom to use the data in novel applications.

In the discussion that followed, a few surprising facts came up. The commercial services do not, in fact, provide comprehensive coverage and have no ambitions to: there are quality advantages in being selective and it helps keep overheads down. Also, acquiring raw citation data only accounts for a fraction of the cost of providing such services. So it is not out of the question that we might see an open pool of citation data being formed, on top of which both free and commercial services are provided.

But if that does happen, it is some way off. The feeling in the room was that there are higher priorities to deal with, most especially in getting ORCID widely adopted, managing expectations of what citation data can be used for, and improving the quality of citations provided by authors.