Because good research needs good data

Researcher perspectives at #idcc11

Summary of presentations in the researcher perspectives session at #idcc11 Give Researchers Credit for All of Their Research Mark Hahnel, FigShare, Digital Science   Reproducible Research Victoria Stodden, Assistant Professor, Department of Statistics, Columbia University   Data Use...

Kirsty Pitkin | 07 December 2011

In this session we heard from Mark Hahnel, Victoria Stodden and Heather Piwowar, who each provided a different angle on the role of the researcher in opening up their data.

Give Researchers Credit for All of Their Research
Mark Hahnel, FigShare, Digital Science
 
Hahnel used his presention to introduce us to FigShare, which he created as a result of a number of niggling issues during his PhD, including what to do with his research data objects and where to host objects, such as video, in a way which could be reliably cited in his thesis.
 
As a scientist, Hahnel recognises that he is an egomaniac and emphasised that researchers need to get an ego boost for their research. FigShare is a place to put all of his research data and get credit for it. Scientists are already out there finding ways to boast about their research. He discussed tools like the Scholarometer, which shows who is doing well in an individual subject area and provides a embeddable widget allowing the researcher to show that they are number one in their niche. Hahnel examined these tools when creating FigShare, and sought to bring in both citations and other alternative metrics to help provide the researcher with an impact score across all of their outputs, not just their papers.
 
Hahnel observed that tools such as YouTube and Flickr are convenient places to host research objects, but they do not produce persistent identifier, which makes them difficult to cite. However, people put up with this because it gives them the opportunity to share their research. This means that useful science is just getting lost into YouTube. A lot of data also gets wasted because it doesn't fit into the story they are telling with a particular research paper, but may be valuable if shared.
 
FigShare allows you to upload your research objects with a persistent identifier so nothing gets wasted because it does not fit into a paper. A new version available from January, but Hahnel was able to provide a demonstration of the system in his presentation, including how to add tags, add context, and how to keep an object private or publish it openly with a persistent identifier to use in a paper. The service will also provide metrics and allow you to sort and organise your research. He concluded by stressing that the more metrics you can add, the easier it is to boast about your research data.
 
 
Reproducible Research
Victoria Stodden, Assistant Professor, Department of Statistics, Columbia University
 
Stodden observed that we are at a watershed point when science turns its eye to reproducibility and data replication.
 
The concept of reproducible research can frame the agenda for digital curation. Stodden explained this by asking a question: “Why is science open?”  The main purpose of publishing materials in the scientific method is to root out error. She observed that in our digitised frenzy we seem to have forgotten this. You see implementations, rather than a sceptical dialogue about rooting out error.
 
She moved on to discuss how reproducibility allows you to talk about the code in the same discussions as the data. Without code, you don't have data. The results that are being generated would not be there without the code. The code allows us to travel to journey through the scientific discoveries made in the data. Many deep intellectual contributions to science are only made through the code.
 
However, she observed that in computational science it is almost impossible to capture every step that you went through to make the discovery, and the code is essentially hidden at the moment. We need independent replication to verify results and we need open code to reconcile differences in implementations in that replication.
 
Stodden noted that we have a credibility crisis, with many more computational papers being published, but only around 21% containing documentation or information about the software packages used to yield the results, and therefore rendering those results unverifiable.
 
Stodden observed that many scientists are not sure about what to share or when to share it, whereas the concept of reproducibility helps to make this clear. Open data and open science do not have a concrete meaning to the everyday practice of many scientists, whereas reproducibility does.
 
She concluded by stressing that data and code should be open and long term access assured to make computational science truly a science.
 
 
Data Use Attribution and Impact Tracking
Heather Piwowar, DataONE post doc with NESCent
 
Piwowar opened her fast-paced presentation with a Newton's comment: “If I have seen a little further, it is by standing on the shoulders of giants.”  She observed that by making data open we are are effectively trying to make our shoulders broader, but this is painful and we worry that we are actually helping others to get ahead, as our culture is built upon becoming the top dog.
 
Piwowar stressed that researchers already think that they will get more citations by sharing their research data, but they are often not sure that their funders and their promotions committee are going to care. We need to facilitate deep recognition of the labour of dataset creation.
 
She argued that the academic CV could be the place to focus on that recognition by making it not about what you did but what difference it made. She demonstrated the Total Impact.org tool, which resulted from a recent hackathon event. This allows you to list research objects (both traditional publications and non-traditional objects such as data set DOIs, Slideshare URLs and so on).  The tool then collects open data about those materials, wherever they are hosted, via APIs to measure how those research objects made an impact and display this on the CV.
 
Piwowar discussed the data available to repositories and the stories that they want to be able to tell about their datasets.  Repositories want to make the case that they are a good intellectual and scholarly investment. They want to be indispensable.  Funders want to be making an impact and changing the world and researchers want to be part of that, which means they need to be demonstrating their individual impact.
 
She stressed that we don't know the right approach yet, so we need to try a lot of things. For this we need a bedrock of information, including open access to citation data. There is a place for the main players such as Google Scholar, Thomson Reuters Web of Science and Elsevier to work on the improvements we need, but she argued that there is also a place for quick and dirty solutions.  However, our citation data is not open enough to enable this.  She also drew the distinction that open access to citation data is one thing, but we also need open access to the full text so we know how that citation actually affected the paper in question and the role it plays in the research.
 
Piwowar closed with three calls to action:
 
  1. We need to raise our expectations about what can and should be mashed up
  2. We need to raise our voices
  3. We need to get excited and make things.
Piwowar believes that these three actions will help us to make broad shoulders we can track, because you can see them and lead us to a future where data attribution counts. She concluded by emphasising that we need not only open data, but open data about our data.