DataStage is a flexible data storage system that provides controlled access, secure backup, and the ability to transfer selected files to a more permanent archiving facility. Designed for research groups, the system appears as a mapped drive on the end-user’s computer, with additional features such as repository submission and addition of metadata available via a web interface.

It is one of the two components of the DataFlow data management infrastructure, designed to allow researchers to work with, annotate, publish, and permanently store research data. The other is DataBank.


Oxford University Bodleian Libraries, as part of the wider DataFlow project

Licensing and cost

The software is free to download and use. The source code is released under the MIT (Expat) license.

Development activity

Version 0.3.1 of DataStage was released in May 2012. While the DataFlow project has finished, the source code repository and mailing list show the code continues to be maintained.

Platform and interoperability

DataStage is written in Python and is designed to work with the Ubuntu Linux 11.10 Oneiric Ocelot operating system. Virtual Machine images are provided for VMWare Fusion 4.x (Mac OSX) and VMWare Player (Windows). While it is intended to integrate with DataBank, the software offers an API so that it can package datasets for submission to any SWORD-2-compliant repository.

End-users can connect to DataStage through a web interface or as a mapped drive on Mac, Linux or Windows machines.

Functional notes

The software gives three levels of password-controlled access: a "private" area only accessible to the file owner and the group leader, a "shared" area giving read-only access to the group, and a "collaborative" area giving read- and write-access. The administrator can invite outside collaborators into the group, pinpointing their level of access. Users can also access and annotate the files through a web interface.

DataStage can be deployed on a local server, or on an institutional or commercial cloud; users can also dynamically invoke additional cloud storage as required. Users can integrate the system into existing backup procedures. The repository interface also allows researchers to push selected files into a more permanent archive facility.

While users can add free-text metadata via the web interface, DataStage also automatically captures a number of general file attributes: date uploaded; file name; last modified; type; owner; location; and size.

Documentation and user support

Documentation is available in the form of an Information for Test Users page and the DataStage documentation wiki. The software has a developer mailing list and JIRA issue tracker. Installation instructions are included in a README file, which comes zipped with the installation package.

Video walkthroughs are available that describe how to set up a suitable server platform, how to download and set up the software, and how to interact with it from the desktop.


End-users interact with the system either as a mapped drive on their computer, implicitly integrating with their operating system’s current navigation structure, or through a web interface. Installation and configuration use a command-line interface.

Expertise required

Installation and configuration would greatly benefit from knowledge of system administration, and use of the Linux command line. The walkthrough videos should make it possible to get DataStage running without expertise, but novice users may not be able to get maximum functionality and customisability from the system.

Standards compliance

DataStage automatically gathers metadata in RDF format. The system uses the BagIt specification when transferring files to a permanent archive, which must be SWORD-2 compliant.

Influence and take-up

DataStage is used at the Oxford Bodleian Libraries. It is unknown whether it is used further afield in production, but it has been tested by

  • UK Data Archive (in conjunction with Eprints);
  • University of Hertfordshire and the Centre for Digital Music, Queen Mary University London (in conjunction with DSpace);
  • RoaDMap project, University of Leeds;
  • YHMAN Shared Virtual Data Centre;
  • a pool of research group leaders in the University of Oxford who implemented ADMIRAL (a precursor to DataStage).
Last reviewed: 
24 November, 2014