Because good research needs good data

IDCC11 Preview: An interview with Victoria Stodden

In the fourth in our series of preview posts ahead of the IDCC 11, we interview Victoria Stodden, Assistant Professor at the Department of Statistics at the University of Columbia. She shared with us what she sees as the main stumbling blocks to open science and explains why she believes...

Kirsty Pitkin | 25 November 2011

In the fourth in our series of preview posts ahead of the IDCC 11, we interview Victoria Stodden, Assistant Professor at the Department of Statistics at the University of Columbia. She shared with us what she sees as the main stumbling blocks to open science and explains why she believes reproducibility of research is a key driver for openness...

You will be talking about reproducible research in your presentation at IDCC 11.  What are the main outcomes are you hoping for from your talk?

I'm very happy to see theme such as open science and open data being discussed more frequently at conferences and workshops. What I hope to do with my talk is frame many of these different issues people are discussing within the context of the scientific method. I've found this to be the most powerful communication tool when trying to reach scientists and science supporters: how can we support scientific norms, particularly regarding computational research?

Reproducibility is a key part of the scientific methods and provides the underlying rationale for openness in scientific knowledge.

Science isn't about finding answers - a diligent researcher in his or her basement figuring things out but telling no one isn't doing science – science is about communicating both discovery and method.

Lots of people argue for open data but far fewer practice what they preach. What do you think is needed to encourage more data sharing?

Big question. Short answer: incentives. Scientists do lots of things they don't like in order to conform with the scientific method (who likes actually writing papers?) and we don't have the incentives in place to reward the full communication of method, which must include data and code sharing, such that published results can be conveniently reproduced by others in the field.

We are moving toward this though and I am encouraged, but scientists are subject to pressure from journal publication requirements, funding agency requirements, promotion and hiring committee expectations, the demands of their particular research problem, incentives to commercialize discoveries, and legal restrictions on sharing, among others.

Data and code sharing requires serious effort and without incentives in place computational scientists face a collective action problem: why do something unrewarded at the expense of doing work that is rewarded? As you can see from the list it is nearly impossible for all the incentive mechanisms to work together to rectify the problem, and there is no easy one size fits all solution to implement.

So we are moving step by step and slowly as the pieces fall into place from the various groups, but we are moving.

Reproducibility is too fundamental to science for this not to be the case.

How do you think we can encourage 'unexpected' reuse of data?

I'd be wary of creating additional demands on those who are sharing data. As you can see from my answer to the previous question, there are plenty of difficult barriers already. One thing that can help, and that doesn't inhibit sharing, is research that shows the usefulness of data and code, both to innovation and to researchers' careers.

Karim Lakhani of Harvard Business School has a study showing research problems posed by Innocentive were typically solved by people in quite different fields, and more studies showing increased citation from shared data and code will help break the myth that data sharing isn't personally advantageous to the sharer.

Making sure we cite appropriately for reused data and code will certainly help as well.

We usually see funders, data creators, universities and data users as the typical set of stakeholders for data. Would you add any to that list?

What about journals (e.g. the Elsevier Executable Paper Grand Challenge), governments (e.g. Data.gov), and the public?

Which group of stakeholders do you believe can do the most to promote a culture of wider reuse of data?

It's an interlocking effort. Each has a role to play and all can take steps to facilitate reproducibility in computational science.

What research data management tools do you think will be the ones to watch in the future?

How about project management and code development tools? One thing that is sometimes forgotten in the discussion of open science and open data is that code is inescapably and intrinsically part of the discussion. When you have data, you have code. There is no other way the dataset got there! And no other way to access it!

Some exciting management tools were discussed at a workshop I co-organized in the summer called Reproducible Research: Tools and Strategies for Scientific Computing. It is a mistake to think of data management in isolation, without regard to its interwoven role in a broader research context. This is where the framing of reproducibility can help.

If there was one change that you could make to improve research data management practice, what would it be?

Version control. It's necessary for all sorts of data sharing practices (and code sharing practices) and can be done on web interfaces, openly or privately until publication.

You may be also interested in previous interviews from this series with Ewan McIntosh, David Lynn and Mark Hahnel.

 

 


Victoria will be presenting a session entitled 'Reproducible Research' as part of the research perspectives strand of talks on Tuesday 6 December. You can still book your place at the 7th International Digital Curation Conference here. 

If you are unable to attend in person, look out for an announcement next week about how you can take part remotely, or track the conference via Lanyrd [http://lanyrd.com/2011/idcc11/] to be notified about the arrangements.