Because good research needs good data

Big Data, Big Deal?

The provocatively titled Eduserv symposium 2012: Big Data, Big Deal? provided a forum for IT professionals, and anyone responsible for managing research data or planning to work with big data in Higher Education, to discuss the meaning of big data and the challenges it presents. Speakers from...

Marieke Guy | 10 May 2012

The provocatively titled Eduserv symposium 2012: Big Data, Big Deal? provided a forum for IT professionals, and anyone responsible for managing research data or planning to work with big data in Higher Education, to discuss the meaning of big data and the challenges it presents. Speakers from both the commercial and academic world came together to reflect on big data trends and the implications for research, learning, and operations in HE.

Big data is considered to be data sets that have grown so large and complex that they present challenges to work with using traditional database management tools. The key factors are seen to be the “volume, velocity and variability” of the data (Edd Dumbill, O'Reilly Radar).

During the day some interesting key themes emerged:

We don’t need to get hung up on the ‘big’ word. While data is increasing exponentially (something a number of scary graphs indicated) this doesn’t have to be an issue, we are getting more used to dealing with large scale data. While the large Hadron Collider produces around 15 petabytes of data annually, ecology engineer Simon Metson from the University of Bristol/Cloudant talked about 50 terabyte datasets. In his lightening talk Simon Hodson, Programme Manager at JISC, provided a quick straw poll from two Russell group universities. Both believed they held 2 petabytes of managed and unmananged data and while one currently provides 800 terabytes of storage the other provides only 300 terabytes. There were concerns from those universities that their storage may be full in the next 12 months. But then storage costs are decreasing, and storage models are changing (often to cloud computing). Guy Coates from the Wellcome Trust Sanger Institute explained that the cost of genome sequencing halves every 12 months and this trend is continuing. It is more than likely that the $1000 genome will be here in the next 12 months and people will soon be purchasing their own USB stick genome sequencers!

The tools are now available. During the symposium speakers mentioned tools such as Hadoop, DB Couch, NoSQL which all allow people to work easily with data sets. There was consensus that people no longer need to create systems to deal with big data, but can now spend that time on understanding their data problem better. Graham Pryor from the DCC saw the data problem as being in part about how you get researchers to add planning into the research data management process, these issues are central to effective data management irrespective of size.

It’s all about the analysis of data. While storage of data can be costly and management of data labour intensive, the analysis of data is often the most complex activity. Keynote speaker Rob Anderson from EMC explained that “If we'd been able to analyse big data we might have been able to avoid the last financial crash”. He sees the future as being about making big-data-based decisions and unlocking value by making information transparent and usable at a higher frequency. However while tools have a role to play here analysis still requires human intervention. On his blog Adam Cooper from CETIS advocates human decisions supported by the use of good tools to provide us with data-derived insights rather than "data driven decisions". During the symposium Anthony J Brookes, professor of Genomics and Informatics at the University of Leicester gave an overview of disastrous divide between research and healthcare (i.e. divide in the management of data e.g. the use of different standards), and the need for knowledge engineering (analysis) to bridge the gap. When taking about data as a way of life for public servants Max Wind-Cowie, from the progressive Conservatism Project Demos, explained that many public centre brands have been toxified and big data can help us to better understand that journey. Devin Gaffney from the Oxford Internet Institute provided a number of interesting case studies showing why prescribed analytics often fail to deliver.

We don’t yet know what data to get rid of. Anthony Joseph, professor at the University of California, Berkeley suggested the selection/deletion of data is the most intractable problem of big data. He pointed out that if you “delete the right data no-one says thank you, but if you delete the wrong data you have stand up and testify” giving the US climate trial as an example. We often find it difficult to be selective when curating data because we don’t know the question we will need to answer yet.

We need data scientists. Many of the talks highlighted the need to build capacity in this area and train data scientists. JISC is trying to consider changing the research data science role in its programmes and Antony Joseph asked HEIs to consider offering a big data curriculum. In his summary Andy Powell Eduserv’s Research Programme Director asked us to think carefully about how we use the term data scientist as the label can be confusing. He noted that there is a difference between managing data, an activity we at the DCC are fairly familiar with, and understanding and analysing data.

Image entitled Flu Genome Data Visualizer by Jer Thorp.