Because good research needs good data

IDCC13 Preview: Kaitlin Thaney

The 8th International Digital Curation Conference is just around the corner and we are anticipating great discussions about data science when our international audience gather in Amsterdam in January 2013. In the third of our series of preview posts, Kaitlin Thaney from Digital Science, gives us...

Magdalena Getler | 27 November 2012

The 8th International Digital Curation Conference is just around the corner and we are anticipating great discussions about data science when our international audience gather in Amsterdam in January 2013.

In the third of our series of preview posts, Kaitlin Thaney from Digital Science, gives us her insights into some of the current issues... 

Your presentation will focus on Infrastructure.  Are there any specific messages would you like people to take away from your talk?

It's easy to think that we've worked out most of the kinks in research when we look at some of the latest advances in astronomy, genomics, and high-energy physics in the news, from the work at the LHC to the ENCODE project. But there are still a number of baseline assumptions in research that need rethinking - and in many cases, fixing. That's what Digital Science was created to address, some of the oft overlooked roadblocks in things like search in the sciences, information management, and the dated incentive system which is keeping us from fully updating our practices in the lab.

We address three areas in our call this year - Infrastructure, Intelligence and Innovation. What do you see as the most pressing challenges across these?

Having worked on infrastructure issues in research for the last six years, I'd say one of the main challenges remains making the right design
decisions. Whether that's an open platform that operates on the back bone of the web or a lightweight software application for use in a research setting, design decisions are key, and in my experience, are often not thought through to the extent warranted.

There's a reason why inefficiency still exists in modern research labs, and it's not a shortage of tools. Part of that still lies in how the systems are crafted for the individual user, but also how it speaks to other systems. 

Also, the age old incentive problem is still keeping us from reaching our full potential, as we continue to largely measure impact as papers produced. Not only does that skew researchers' incentives to better manage and make available say, for instance, the data accompanying their research or the code needed to execute the experiment, but it only presents issues for other specialists whose main output may be software, not scholarly papers. 

We need to rethink how we measure and reward research so that it better reflects a researcher's contribution on his/her community and give the system a hard refresh.

And in terms of opportunities, do you see potential in data science as a new discipline?

Absolutely ... though it's not a "new" discipline, necessarily. There is an increasing understanding about the power in bringing together skillsets such as mathematics, machine learning, statistics, computer science and domain expertise (though not always necessary), which is helping us redefine how we think of hypothesis-driven research, becoming more data-driven. 

What I find particularly fascinating is the spotlight it's putting on how we teach science undergraduates - making sure they not only have the practical skills for working in a lab or conducting an experiment, but also the statistical literacy and analytical reasoning to understand the information they're producing and collecting.

The conference theme recognises that the term ‘data’ can be applied to all manner of content. Do you also apply such a broad definition or are you less convinced that all data are equal?

I'm an equal opportunity data fan (and open purist, carried over from my time at Creative Commons). Too often, I feel, we get caught up in debates about the "worthiness" or "value" of particular data sets, a legacy from the publication world where only the most polished, interesting data counts. It's pervasive and keeping us from doing more robust, reproducible work. I am a strong proponent of not cutting oneself off from yet unknown opportunities, and unfortunately classifications such as "junk data" are not only increasingly silly in the digital age, but borderline harmful.

You’ll undoubtedly have looked at the programme in preparation for IDCC. Which speakers / sessions are you most looking forward to?

I'm looking forward to hear Ewan Birney's keynote. I've long been a fan of his work, and keen to hear his perspectives post-ENCODE.


Kaitlin's presentation entitled Making Research More Efficient is on Day 1 of the conference, 15 January. Programme is available.

If you have not already done so, you can still book your place

Please share your attendance at IDCC13 via Lanyrd