Because good research needs good data

Oily big data at Online Information 2012

A summary of the Online Information Conference  20th November to 21st November held at Victoria Park Plaza Hotel. Presentations were clustered around 5 key topics which ran in three parallel tracks: the multi-platform, multi device world; social collaboration; big data; search; and new frontiers...

Marieke Guy | 26 November 2012

Online Information is an interesting conference as it brings together information professionals from both the public and the private sector. This year’s Online Information from 20th November to 21st November felt like a slightly different event to previous years. The conference had condensed down to 2 days from 3, dropped its exhibition and free workshops and found a new home at the Victoria Park Plaza Hotel. The changes resulted in a leaner, slicker, more focussed affair with even more delegates than last year.

There was much to interest those involved in Research Data Management. Presentations were clustered around 5 key topics which ran in three parallel tracks: the multi-platform, multi device world; social collaboration; big data; search; and new frontiers in information management.  The big data sessions in particular offered insight into future trends in both the academic and commercial sectors. By the end of the 2 days I'd heard every oil analogy imaginable (data is the new oil, big data - the crude oil of our time, information is the oil of the 21st century, and analytics is the combustion engine)!

I’ve cherry-picked the most relevant sessions I attended.

Making sense of big data

Mark Whitehorn, Chair of Analytics, School of Computing, University of Dundee, Scotland gave a very different overview of big data to those I’ve seen before. He explained that data has always existed in two flavours: 1) stuff that fits into relational databases (rows and columns) - tabular data 2) everything else: images, music, word documents, sensor data, web logs, tweets. If you can’t handle your data in a database then it’s big data; but big data doesn’t necessarily have to be huge in volume. You can put big data into a table but you probably won’t want to; each class of big data usually requires a hand-crafted solution.

Whitehorn explained that big data has been around for over 30 years, since computers first came about, but that we’ve focussed on relational databases because they are so much easier. Two factors have changed recently: the rise of machines and an increase of computational power. Whitehorn gave the example of the Human genome project completed in 2003. Scientists haven’t done anything with the knowledge yet, this is because there are 20 – 25,000 genes in the human genome, but all have different proteins within them. This data is big and more complex than tabular data and it is going to need to be analysed. Whitehorn noted that for most organisations out there it is it probably easier to buy algorthyms from companies (for example to mine your Twitter data) than it is to develop them. He suggested that we choose our battles and look for areas where real competitive advantage can be gained. Sometimes it is in the garbage data where this can really be achieved. For example Microsoft spent millions on developing a search spell checker, Google just looked at misspellings and where people went afterwards.

Many of these practices are the work of a new breed of scientist: the data scientist. Dundee are rolling out the first data science course in UK from next year.

The power of the cloud for academic publishing

Mark Hahnel, founder of  Figshare, explained that the falling cost of cloud based solutions such as Amazon Web Services means that research data doesn't need to be lost to the annals of history. However Universities just aren’t providing internal mechanisms for researchers to share and store their research dataso they are using storage/social media sites like Github, Flickr, Slideshare and Scribd. Hahnel’s service Figshare allows researchers to publish all of their research outputs in a citable, sharable and discoverable manner. All file formats can be published, including videos and datasets that are often demoted to the supplemental materials section in current publishing models. Researchers use it to upload posters, dissertations, research notes and even failed grant applications. It is different from Dryad and other data store in that it takes all data, not just datasets that are related to a published article. It also provides users with a DOI rather than just a static link.

From record to graph

Richard Wallis, Technology Evangelist, OCLC, UK explained that Libraries have been capturing data in some shape or other for a long time: hand written catalog cards, Dewy, to Machine Readable Catalogue (MARC) records. They have also been exchanging and publishing structured metadata about their resources for decades. Libraries were fast to move onto the web: online OPACs began emerging from 1994, and they were also keen to use URIs and identifiers. However as they move into the web of data does their history help? Libraries are like the oil tankers set on a route, it’s very hard for them to turn. A history of process, practice, training, and established use cases can often hamper a community's approach to radically new ways of doing things. OCLC’s WorldCat.org has been publishing linked data for bibliographic items (270+ million) since earlier this year. The core vocabulary comes from schema.org, a collection of schemas co-operated by Google, Bing, Yahoo and Yandex. The linked data is published both in human readable form and in machine readable RDFa. WorldCat also offer an open data licence. The rest of web trusts library data so this really is a goldern opportunity for libraries to move forward and Wallis believes that lots of libraries and initiatives will publish linked bibliographic data in the very near future.

Research Data Management

A final mention goes to the Research Data Management (RDM) session, which comprised of presentations by myself, and Robin Rice, data librarian at the University of Edinburgh.

My talk looked at supporting libraries in leading the way in research data management. RDM initiatives at several UK universities have emanated from the library and much of the DCC institutional engagement work is being carried out in conjunction with library service teams. It has become apparent that librarians are beginning to carve a new role for themselves using their highly relevant skill set: metadata  knowledge; understanding of repositories; direction of open sharing of publications; good relationships with researchers and good connections with other service departments. I began by looking at work carried out in understanding the current situation, such as the 2012 RLUK report on Re-skilling for Research, which found that there are nine areas where over 50% of the respondents with Subject Librarian responsibilities indicated that they have limited or no skills or knowledge, and in all cases these were also deemed to be of increasing importance in the future. Similar reports have indicated that librarians are overtaxed already, lack personal research experience, have little understanding of complexity and scale of issue. They need to gain knowledge and understanding of many areas such as researchers’ practice and data holdings, Research Councils and funding bodies’ requirements, disciplinary and/or institutional codes of practice and policies and RDM tools and technologies. I then took a look at possible approaches for moving forward such as the University of Helsinki Library’s knotworking activity, the Opportunities for Data Exchange (ODE) project which shares emerging best practice and Data intelligence 4 librarians training materials at the Univeristy of Delft. I also looked at UK based activities such as RDMRose, a JISC funded project led by the Sheffield University iSchool, to produce OER learning materials in RDM tailored for Information professionals.

My overview was complimented by Robin’s University of Edinburgh case-study.  Edinburgh had an aspirational RDM policy passed by senate in May 2011. The policy, which was library led and involved academic champions, has been the foundation for many other institution’s policies. Some of the key questions for the policy were: who will support your researchers’ planning? Who has responsibility during the research process? Who has rights in the data? Rice explained that when developing policy you need to be aware of the drivers fro your own institution, be able to practice the art of persuation and have high level support. Consider the idea of a postcard from the future – what would you like your policy to contain to enable a future vision? Rice gave an overview of training work carried out in Edinburgh including the MANTRA modules and tailored support for data management plans using DMPonline. She also highlighted the Universities DataShare an online digital repository of multi-disciplinary research datasets produced at the University of Edinburgh, hosted by the Data Library in Information Services. In her conclusion she talked about the challenges for librarians moving in to RDM: a lack of time for these activities, the need for new partnerships, establishing credibility in a new area of expertise and getting their hands dirty with unpublished material.