Defining institutional data storage requirements

18 March, 2013

Institutions developing infrastructure in support of research data management are engaging with a whole range of issues, both cultural and technical. One that stands out as a clear priority is that of research data storage, both for “live” data, during the active phase of research, and post-project archiving.

On the 25th February, Jisc, Janet and the DCC hosted a workshop that brought together service providers from a variety of HE institutions with commercial suppliers of storage solutions in an effort to develop a better understanding of institutional requirements and the extent to which these can be met by current provisions. Where gaps exist, the aim was to identify what needs to be done to close them.

The first part of the day comprised presentations from five institutions at different stages of service provision, giving some context and detail for later breakout discussions with suppliers.

For me, the most positive aspect of the workshop presentations and discussions was just how much consensus there was across the spectrum. Even though, in some cases, the consensus was that we simply don’t have all the answers and must do further work, defining the issues is a vital step.

 

COSTS OF STORAGE

There was fairly broad consensus across a number of participants on the current costs to an institution of providing active storage. Generally speaking, the price of bits and bytes storage amounted to around £500 per terabyte, per year for a single copy held on active, spinning disc storage. In most cases this rose to £1000/TB for two disc copies in separate locations, providing the system with redundancy, with a further tape back-up. These figures cover hardware, running costs and human infrastructure; some of the institutions and suppliers believed that they could provide similar provision at somewhat reduced cost.

Addressing the issue of long-term, archive storage, there was a good deal more uncertainty over the true costs of preserving data; the resource required for curation is poorly understood but expected to outstrip that required for holding the data many times over. With that in mind, discussions of costs assumed a low-curation model of preservation. Allowing for depreciation in the cost of storage media, figures of around £5000/TB were mooted to retain data in perpetuity however, this does make a number of assumptions and couldn’t be considered as robust as the figure for active data storage.

 

DESIGNING THE STORAGE LANDSCAPE

When outlining the shape of ideal future provision, several of the presenters discussed the concept of tiered storage, in recognition of the spectrum of values and requirements of research data. The problem at the moment, said one, is that we cannot provide a rich enough storage landscape.

At present, expensive, spinning-disc storage is used to hold everything from active research data to redundant datasets that haven’t been accessed in years. It was agreed that any tiered system would have to present a smooth interface to the researcher but how this might be implemented is unclear. As a corollary, there may be a need for software systems that flag duplicated or inactive data that could be migrated to cheaper levels of storage.

System design cannot be done without the input of academic staff who are needed to qualify data value and identify acceptable recovery times. In one case, this question had been put to academic staff and the results were somewhat surprising, with the majority indicating that two weeks was an acceptable recovery time for the main bulk of data.

For long-term retention, all institutions envisaged a hybrid system in which external data centres are used wherever appropriate with what data remaining accommodated by institutional systems. One institution estimated the proportion of external/institutional archived data to be in the region of 25% to 75% respectively.

 

ENABLING COLLABORATION

A major problem that needs to be addressed when designing active data storage architecture is the difficulty of supporting the collaborative sharing of active data. At present most institutions achieve this through somewhat convoluted workflows for assigning access keys to external researchers. The difficulties of engaging with these systems are driving researchers towards a whole host of third-party solutions such as Microsoft Skydrive, Google Drive and Dropbox.

For some researchers these systems may be perfectly adequate for their needs but security, back-up and tracking issues make it attractive for institutions to bring the data back within their own jurisdiction; in some cases, there may be institutional policies in place that proscribe their use.

 

HOW MUCH SPACE DOES A RESEARCHER NEED?

Of course, when it comes to designing future storage provision, the fundamental question that needs to be addressed is ‘How much?’ Estimates from a diverse set of universities of the total amount of storage required, based on what is currently held by researchers, ranged from 300TB up to 3.5PB. These figures are likely to be quite inaccurate as the dispersal of data holdings (one institution’s researchers had an average of seven places where data was stored) means that even researchers themselves have difficulty quantifying their data volumes.

There was a general recognition that some kind of control is needed when initially offering managed storage to researchers to prevent the system being swamped by the wholesale migration of low-value data. Even less certain is the volume of data that will need to be retained for the long term and the most appropriate platform and access conditions for universities’ data collections.

 

OVERCOMING THE ‘PC WORLD EFFECT’

Convincing researchers to use managed storage at a cost of £800-1000 per TB per year can be difficult when any of them can purchase a terabyte hard drive for £60 and run their own ad hoc system. Many institutions directly addressing research data storage infrastructure are seeking to overcome this barrier by providing a certain amount of storage for free and then charging for any usage over and above that provision.

Amounts that are being offered varied from 100GB to 5TB per researcher but in each case the expectation was that this quantity would exceed the requirements of the average user.

 

QUANTIFYING GROWTH

Of course, estimates of current holdings aren’t sufficient to build an accurate projection of future requirements, there needs to be some understanding of the likely scale of growth of data generation.

Again, there seemed to be fair amount of agreement over the rate of growth with several institutions quoting a figure of around 25% per year. For some, this was considered to be far too low; one presenter suggested that using Moore’s law to calculate the rate of growth might still underestimate the scale.

 

INTEGRATION OF CLOUD SERVICES INTO INSTITUTIONAL OFFERINGS

Most participants recognised the value of some kind of hybrid storage system, in which appropriate use of the cloud boosts the institution’s capabilities. There is certainly a need for flexible, temporary solutions enabling an institution to accommodate peaks in the use of active, managed storage.

In terms of quantities, there was evidence that sudden requests for storage in the region of 10TB is not unusual but hard to accommodate. Commercial suppliers indicated that they could easily meet this demand and regularly do for quantities up to around 160TB. Per terabyte costs for cloud-based storage appeared to be competitive compared to the costs of in-house provision, with figures of around £860/TB/year including redundancy. For some institutions, using cloud storage is seen as a way of side-stepping the difficulties of providing an in-house repository; one presenter described the prospect of providing open access to parts of the institution’s systems as ‘scary’.

Although there is a clear use-case for integrating cloud storage into institutional systems, there was some considerable concern surrounding service costs and the financial liabilities of using commercial providers when future patterns of use are only hazily defined. Discussions in the latter half of the workshop revealed that some of the third-party providers are currently serving quite specific markets whose use-cases don’t fully map onto those of the HE sector. There will need to be work done on the part of providers to tailor their services for the market and on the part of institutions to more accurately define their needs.

One area of particular concern was that of egress charges, the cost attached to the download of data that is held in commercial cloud storage. Institutions indicated that they couldn’t sign up for cloud services without cost ceilings in place to limit liability, whereas commercial suppliers are understandably wary of offering unlimited data access. One supplier doesn’t charge for egress but does have a fair use policy in place that limits the amount that can be accessed to 5% of the total amount stored per month. Clearly this is appropriate for a standard archive or backup storage but could be of limited use as part of a cloud-based, open-access repository or active data store.

Cost is not the only consideration to take into account, cloud services need to meet institutional data governance requirements. There was consensus that the physical location of the data should be within the European Economic Area, although policies aren’t harmonised across the zone this was felt to be an acceptably small risk.

The integration of cloud services with local systems needs to be a seamless experience for the end user. Providing universal authentication and avoiding multiple logins was seen by many as vital for lowering the barriers to use of alternative storage areas.

 

KEY MESSAGES:

  • Physical costs and pricing models of data storage are understood but are a fraction of the true cost of preserving data in the long term. Curation costs are not well quantified.
  • Storage design is likely to favour tiered, hybrid models.
  • Systems for sharing live data with collaborators are required.
  • Most data curation must remain an institutional responsibility but storage and some preservation actions could be outsourced.
  • Authentication and access issues need to be addressed for cloud services.
  • Cost and use models for research data in the cloud need to be properly developed.