IDCC13: What’s in a name? The ‘data scientist’ symposium

21 January, 2013

The IDCC symposium this year was on ‘What is a data scientist?’ After some shuffling of chairs, Liz Lyon opened proceedings with a quick show of hands on who thought they were a data engineer, data analyst, data librarian, data steward, data journalist or data publisher? Data librarian and data steward got the biggest response, but would these people call themselves data scientists? Or are there other terms they would use?

Four members of the panel gave their perspectives on the role of a data scientist. First up was Stephanie Wright from the University of Washington Libraries. She’s found that the term science can be a stumbling block; it has such strong connotations that it’s hard to convince people that you’re discipline agnostic. She prefers the term ‘data concierge’ as you’re responding to all manner of requests, often achieving the impossible. Stephanie feels that the most important facets of the job are the ones not learned at library school, namely project management skills, persistence, and an ability to speak various research dialects.

Louise Corti of the UK Data Archive spoke next. She classed data scientists as people who can understand data, analyse and transform it. In her experience they are usually males who have done early career research, are highly intelligent, very technically competent and relish being called ‘data geeks.’ A similar focus on data analysis was put forward by Scott Edmunds of GigaScience and Francine Bennett of Mastadon C. Scott said that if “data is the new oil” we should use it as such, collating huge datasets to mine and exploit. Francine meanwhile described data scientists as doing something magical by drawing new insights from what can seem like a disorganised heap. Rather than skills or formal training, Francine focused on the characteristics needed: curiosity; humility and hubris; and pragmatism.

Mark Hahnel of FigShare kicked off the discussion by asking if it was only him who defined a data scientist as someone who analyses, mines and visualises the data? Stephanie questioned whether the role is just the analysis - does it not also incorporate data management skills and supporting researchers? Louise described the family of roles at UKDA, where they have data managers as well as data scientists. So are these hybrid roles or two separate but related skillsets - one about data curation and another about data analysis?

Sheila Corrall of the University of Pittsburgh questioned the representativeness of job titles, relaying how the term information scientist got hijacked and lost its meaning. William Kilbride of the DPC echoed these sentiments, describing how job titles come into vogue and pass. Around 2002 lots of the job applicants he interviewed were 'knowledge managers.' He always asked what the person understood by the term and what they actually did. To a large extent job titles are irrelevant - what matters is people’s experience, skills and characteristics.

And to reach a conclusion we need to focus on those skills and characteristics. It’s all too easy to get sidetracked by a discussion on job titles with inevitable differences of opinion and interpretation. Liz concluded that we could start to define the skillsets and training needed. Interestingly, the topic of data analysis had already been raised in the training workshop on Monday. When asked how data analysis fitted into the curriculum for the OpenExeter and Research360 training courses, both teams responded that they’d intentionally left this out. Their courses focused on data management - data analysis would be taught by each discipline separately in whatever research skills and methods courses they run locally.

So, what is a data scientist? Consensus seemed to be that it’s someone who approaches data with scientific mind, analysing it to unlock its stories. This is different to data management but closely allied. And I think we should be making those allegiances stronger. If we can’t tell the stories of our data, we can’t make such persuasive arguments to invest in data management and sharing.