Because good research needs good data

Software preservation and sustainability

Diana Sisu | 16 March 2017

Overview

Author: Jez Cope, University of Sheffield

I work as Research Data Manager at the University of Sheffield, but I have a background in computer science, so in my mind research data is inextricably linked with the software used to create, process, analyse and visualise it. As a result I will happily talk at length, to anyone willing to listen, about the importance of archiving, sharing and preserving this software, whether it was written by the researchers themselves, or is simply a proprietary package required to make sense of the data. Software is a vital part of the research process and, in the case of software specially written for research, an important output of that process in its own right.

One of the ways we are handling this at Sheffield is through a strong collaboration between the University Library and Research Software Engineering @ Sheffield, but given the range of international attendees at IDCC I wanted to see what colleagues at other institutions were doing in this area. I therefore suggested that we run a Birds of a Feather (BoF) session, to give people the opportunity to discuss this in more detail and learn from each other.

We had a variety of people attending the session, from research software engineers who grapple with software every day, to archive professionals with expertise in preservation of digital content.

Some interesting projects were mentioned. The British Library plans to maintain an archive of the software needed to access items in their own digital collections, while The Internet Archive Software Collection already has over 163,000 items in its vintage and historical software library, including many open source projects along with software and games from obsolete consoles and microcomputers.

To give a flavour of the discussion, I asked a couple of the attendees to contribute their own perspectives. Iain Emsley is a Research Software Engineer at the Oxford e-Research Centre and member of Reproducible Research Oxford. Jenny Mitcham is Digital archivist at the University of York, based in the Borthwick Institute for Archives.

Research Software Engineer’s perspective

Iain Emsley, University of Oxford

I joined the BoF as someone who is interested in the idea of keeping software alive where possible and practical. I wanted to learn about what archivists want as well as hearing other thoughts on the subject.

I'm going to be a heretic here and suggest that not all software can or should be preserved in such a way that it can be run for all time. There are, however, basic skills that can increase the chances of it doing so. Some of the basic skills, such as creating tests, documentation, linking to any papers that use the software and any specifications for what the software is meant to do, should help in endeavouring to keep software alive and useful.

A social plan of attack might be the build a community around the software, where the community continues developing the project or forks it for new uses. Community building itself is hard and keeping it running is equally difficult.

Yet keeping the software running is a major task in itself.

As the BBC discovered in 1986 with the Domesday Book laser discs, technology can quickly become obsolete (http://www.bbc.co.uk/history/domesday/story). The CAMiLEON (/resources/external/camileon-creative-archiving-michigan-and-leeds-emulating-old-new) used the project as an exemplar of using emulation to revive the software on a more up to date machine. 25 years later, the Domesday Reloaded project brought the data back to life using the Web.

Ian Gent's The Recomputation Manifesto (https://arxiv.org/abs/1304.3674) argues that the only credible technique is to provide virtual machines to allow for the recomputation of the original experiment. Gent does raise an important point that we need to address: the provision of the environments and all parameters, including any recompiling of software or languages. We should be writing down the parameters used for experiments and all changes made so that experiments can be run again or software compiled again in the same way.

Containers, especially Docker, are being used, as Rob Haines and Caroline Jay argue (https://www.software.ac.uk/blog/2016-09-12-reproducible-research-citing-your-execution-environment-using-docker-and-doi), to provide a recipe for an application, including package names and numbers, to allow for the citation of the software environments.

There are a number of alternatives that are now appearing, so I don't think that there is one way to run the software even after the project has ended. Emulation may be one way of achieving a build after the hardware has become obsolete or the use of virtual machines or containers.

In The Significant Properties of Software: A Study (http://purl.org/net/epubs/work/65878 ), the authors comment: “Good software preservation arises from good software engineering” (p24). Although the concern of many engineers is the current state of software, preservation can be supported through the development life-cycle, documentation (requirements, comments, change notes, installation, change notes and so on), storing code in version control, testing, and re-using components all aid both development and preservation.

I would argue that proposals and project briefs should build in a mandatory section about how a project and its software and data artefacts are going to be archived. Whilst we all hope for the continuation of the project, there should be plans in place to put the software and all its artefacts and documents together before needed. This may also mean having to consider redirects if URLs change or artefacts move to try and maintain links where possible. Preservation should be a goal through solid practice, not a side effect. It is an ongoing set of processes and as engineers, developers and designers, we need to be aware of this and communicate this as well as listen to requirements.

Digital Archivist’s perspective

Jenny Mitcham, University of York

I joined the Software Preservation birds of a feather session because I’m aware that this was an area where my own knowledge is limited. I was keen to find out a bit more ...or rather to shamelessly steal examples of good practice from others.

At the University of York we have guidance about Research Data Management (RDM) on our website (https://www.york.ac.uk/library/info-for/researchers/data/) and run training courses on managing research data. We have had an RDM policy for the last few years which is currently up for review. But none of these things specifically mention research software.

However, I know that research software is a thing ...and a thing that we need to be concerned about. Several of our researchers have already deposited software/code with our Research Data York service and very occasionally in our RDM teaching sessions, questions about research software are raised and I’m never entirely sure how to answer them.

My question to the assembled group was simple “What advice can we give to our researchers who are creating their own software as part of their research?”

The first and perhaps most useful piece of advice was to direct them to the resources on the Software Sustainability Institute (SSI) website (www.software.ac.uk). This website is full of information and advice as well as practical tips on how to make software more sustainable. There is a very useful section about Software Management Plans (https://www.software.ac.uk/software-management-plans) - I had no idea that there was SSI guidance within the DMPonline (https://dmponline.dcc.ac.uk) tool so this was a really useful find.

Here is a summary of suggestions made around best practice advice we can give our researchers.

  • Plan – It was agreed that information about software and how it will be managed throughout the project and beyond should be included within a data management plan for a project...or indeed a tailored software management plan.
  • Document - We always encourage our researchers to document their data and obviously this is equally applicable for code. Comments within the code are of course good practice, but it is also important to record which libraries it uses and fully document what it does and how it does it. This means that if the code is no longer working, someone would be able to write another piece of software that performs the same function.
  • Publish – Github was suggested as a good place to make software available. This platform is designed for this purpose and is ideal for handling versioning and providing access to a living and evolving thing such as software. There was some discussion about whether a copy of the software should also be deposited in the institutional repository alongside the research data. Different practices are apparent here but I would suggest it could be valuable to do so.
  • Identify and cite – Software should be identifiable and it should be cited alongside the research data within relevant publications.
  • Licence – Ensuring the software has a clear licence is important. People need to be able to know what they can and can’t do with it. It was noted that software released without a licence is not very useful.

Since attending this session I have given one RDM training session – which I went into prepared to talk a bit more about research software. However, when I asked the question “Is anyone creating software as part of their research?” no-one admitted to it, so I sadly wasn’t able to impart any advice on this subject. May-be next time!

In the meantime though I will be looking at what actions we can take to ensure that software created as part of a research project is not forgotten.