Because good research needs good data

IDCC13: Cloud Services

Immediately before lunch on Wednesday Paul Walk of UKOLN chaired a session titled Cloud Services, in which Dirk von Suchodoletz of the University of Freiburg and David S. H. Rosenthal of the LOCKSS Program presented their papers. Von Suchodoletz’s paper, co-authored with Klaus Rechert and...

Patrick Mccann | 21 January 2013

Immediately before lunch on Wednesday Paul Walk of UKOLN chaired a session titled Cloud Services, in which Dirk von Suchodoletz of the University of Freiburg and David S. H. Rosenthal of the LOCKSS Program presented their papers.

Von Suchodoletz’s paper, co-authored with Klaus Rechert and Isgandar Valizada, also of the University of Freiburg, was titled “Towards Emulation-as-a-Service - Cloud Services for Versatile Digital Object Access”. Von Suchodoletz started his presentation (PDF) by describing the versatility of emulation as a strategy for the access of digital objects and original environments, before noting the complexity that is often involved in emulating a system. After noting the good work done on emulation and in the provision of digital preservation services by the KEEP and PLANETS projects, he noted several key challenges including catering for the wide range of devices which users would wish to use to access emulated environments, the wide range of original environments to be supported and the need to integrate emulators with institutional frameworks. Emulation-as-a-Service (EaaS) was then proposed as a way to meet these challenges.

EaaS would allow users to access emulated environments running in a well-defined host environments, removing the need to set up emulators on a range of platforms, and they would do so using client software built on existing, widely used technologies, though there would be a need to map modern inputs to those expected by the emulated system. This sort of arrangement could allow organisations to share effort and resources and to take advantage of economies of scale and of the long tail. It also allows for content control - licence-restricted objects would remain within an institution’s perimeter and measures could be put in place to control and record access to objects. Von Suchodoletz also proposed a software archive of components to facilitate the reproduction of original environments. He then showed a few screenshots from an implementation as part fo the bwFLA project, showing a system emulating several versions of the Mac OS on different architectures as well as a version of BeOS and Windows 3.11. In response to questions regarding whether or not today’s users have the relevant knowledge to make use of old systems like these, he noted that documentation would also be made available, even going beyond vendors' user guides to take account of any bugs or quirks not covered there.

“Distributed Digital Preservation in the Cloud” was written by Rosenthal with Daniel L. Vargas, also of the LOCKSS Program. Rosenthal began the presentation (PDF) by saying that he’d run through the experiment described in the paper quickly, before going on to look at developments since it was conducted. The experiment was designed to determine whether the use of cloud storage could reduce the cost of digital preservation.

There’s a broad consensus across digital preservation costs research that ingest accounts for about 1/2 of the cost of preservation, storage 1/3 and dissemination 1/6. Storage hasn’t been thought to be a big problem. Kryder’s law describes the growth in storage capacity over time in a linear fashion, similar to Moore’s law for processing power. The decreasing cost of storage implies that if you can store data for a few years, you can store it for a longer period for very little extra cost. Rosenthal suggested that this is unsustainable - what if the linear trend described by Kryder’s law is in fact the middle portion of an s-curve? He drew attention to a graph by David Anderson, describing such curves for a range of storage technologies. He also noted the doubling of disk storage prices caused by the 2011 floods in Thailand. He described the problem being faced - with volumes growing at 60% per year, storage densities growing at 20% per year and IT budgets growing at 2% per year, if storage is currently 5% of your budget it will exceed 100% in 10 years’ time.

The experiment involved running a LOCKSS box in Amazon’s cloud to collect 3 months of cost data, scaling that up to match a median box on the Global LOCKSS Network (GLN) and projecting out to compute a 3-year cost of ownership. The conclusion was that, provided running costs were less than $280 per month, a local box is cheaper. Rosenthal described a prototype long-term costs model used in the experiment (the paper notes the shortcomings of the standard Discounted Cash Flow technique using a Net Present Value). Local costs were based on figures from Backblaze and on running three geographically separate replicas, to match Amazon S3. It emerged that using S3 is always more expensive than the local alternative.

Rosenthal then looked at cloud storage pricing in more detail. In particular, why aren’t the prices of such services falling in line with Kryder’s law? In the case of Amazon, it was priced very aggressively on launch, and that price has been held as costs have fallen. Competition dictates pricing, rather than costs. Further, switching is expensive - a competitor would have to be very much cheaper to make the cost of transfer (staff costs, bandwidth, service disruption) worthwhile. However, they do provide value for money for most of their customers, whose data lifetime is very much shorter than the life of a disk and who experience spikes in demand. The model is just not good value for digital preservation, where data lifetimes exceed the life of a disk and there doesn’t tend to be spikes in demand.

Since conducting the experiment, Amazon has introduced a new service called Glacier. It’s a low cost data storage service for infrequently accessed data for which retrieval times of several hours are acceptable. Access costs can quickly mount up though, and can be difficult to predict. An analysis of this service determined that it is still too expensive to use as a stand-alone solution, especially considering the latency and vendor lock-in, a particular issue for Glacier given the access costs. The use of Glacier alongside a local system may possibly provide a viable solution though.

Rosenthal ended with a slide looking at how much it would cost to keep everything in the cloud - the cost would exceed the Gross World Product by 2018.