Because good research needs good data

IDCC13 Preview: Kevin Ashley

The 8th International Digital Curation Conference is just around the corner and we are anticipating great discussions about data science when our international audience gather in Amsterdam in January 2013. In the first of our series of preview posts, our director Kevin Ashley, gives us his insig...

Magdalena Getler | 30 October 2012

The 8th International Digital Curation Conference is just around the corner and we are anticipating great discussions about data science when our international audience gather in Amsterdam in January 2013.

In the first of our series of preview posts, DCC's Director Kevin Ashley, gives us his insights into some of the current issues... 

You are a conference co-chair. Are there any specific messages would you like people to take away from the conference?

I think the conference has multiple messages to offer to what is a diverse audience, professionally and geographically. Overall, I would like everyone to come away aware of the potential for reuse of the work that others are doing and the potential for collaboration. Whether it is software tools, training materials, methodologies or analyses, many of the talks describe things that others can use to deal with data curation issues in their own research group, institution or national setting.

There are already signs of worrying duplication of effort in the digital curation field and this is something we can't afford. Internationally we will struggle to command the resources to solve these problems once. We can't afford to solve some of them twice or three times and others not at all.

We address three areas in our call this year - Infrastructure, Intelligence and Innovation. What do you see as the most pressing challenges across these?

I would not want to single any one out. We have immediate and pressing challenges in the area of infrastructure, particularly because its effective use will be key to realising necessary efficiencies at a time when money may be harder to come by.

We don't lack for either intelligence or innovation in this field, but we need to work harder to coordinate and build on the innovation and to encourage greater adoption of some of the techniques described under the 'intelligence' heading in the call for papers.

And in terms of opportunities, do you see potential in data science as a new discipline?

I'm not at all convinced that it is a new discipline. I think people have been doing what we call data science for many years, albeit with different names and without the ease that increasing compute power & data collections offer. That doesn't mean that there isn't potential.

I think one goal that's within reach is to enable those who currently think of themselves as domain specialists who happen to deal with data to realise their potential as generalists who can apply their skills in data analysis & synthesis in many disciplines. It's an approach we've used in much of the training developed for the DCC's DC101 course, for instance.

Matters such as data quality can be taught in a generic way and only then does one need to consider how to apply them in specific research domains. Many data scientists learn their skills on the job in a way that can make them believe that their skills aren't transferable; they usually are.

The conference theme recognises that the term ‘data’ can be applied to all manner of content. Do you also apply such a broad definition or are you less convinced that all data are equal?

I have always been an advocate of the broadest possible definition of data. Even in the relatively constrained area of relational databases with fields and rows, I've always been at pains to point out that the cells don't just contain numbers or short text strings, but might contain audio, video, rich documents or a variety of other types of content.

This was important for the design of systems like NDAD, the initial service for preserving and providing access to UK government data for the Public Record Office which we were building in 1997. It was also important to communicate this to archivists and records managers who were making selection decisions about what would be preserved. By encouraging them to take a broad view of what 'structured data' was, we acquired material that might otherwise have been lost.

The NSF/JISC/NEH/NWO/ESRC/SSHRC/AHRC/IMLS-funded 'Digging Into Data' challenges have also done a great deal to encourage a broad view of what data can be. That doesn't mean that all data are equal. What you can do with a data collection depends a lot on the amount of structure it has and on many other properties of it. But there isn't a simple hierarchy of good and bad data; quality, and even the 'data-ness', of something is in the eye of the beholder. I can read a novel as a novel for its enjoyment. You can take the same text and use analytical techniques to make deductions about authorship and style. It's the same content, but it is only data to one of us.

You’ll undoubtedly have looked at the programme in preparation for IDCC. Which speakers / sessions are you most looking forward to?

As a chair, it would be unseemly for me to pick out particular submissions – we value them all! But I'm glad that we've got a greater percentage of talks from speakers across Europe this year, which was one of the reasons for holding the conference in Amsterdam.

We knew there was lots of innovative work taking place across the continent that wasn't getting the attention it should have done at IDCC in past years. I'll be catching as many of these talks as I can.

I'm also looking forward to two workshops on Monday and Thursday that I'll be personally involved in. The first is to promote awareness of and European participation in the fledgling Research Data Alliance, and the second will try to develop a common understanding of pricing (as opposed to costing) schemes for data repositories.

If you have not already done so, you can still book your place

Please share your attendance at IDCC13 via Lanyrd