Because good research needs good data

Variety in RDM requirements, a curse or the spice of life?

A report from the 'requirements' breakout discussion at RDMF14, the Research Data Management Forum event in York 9-10th November 2015.

A Whyte | 18 November 2015

  The RDMF14 meeting in York 9-10th November had ‘research data (and) systems’ as it’s main focus. Those brackets signified the expectation that integration would be a common talking point, and it did turn out to be an omni-present theme. There were three breakout sessions, with ‘integration’ the headline topic for one, ‘shared services’ another, and ‘requirements’ a third. This is a summary of that breakout session on requirements, and I’ve aimed below to distil the conversation around six issues:

Is ‘cathedral thinking’ hampering RDM?

In his welcoming comments DCC colleague Martin Donnelly drew attention to the RDMF meeting a year ago, where we attempted to list ‘systems’ or software platforms relevant to RDM. We came up with around 80, even excluding discipline-specific tools. Alongside that as a backdrop for the breakouts we had Arkivum CTO Mathew Addis panel presentation. This referenced software engineer Eric Raymond’s famous essay ‘The Cathedral and the Bazaar’ [1], which critiqued the then prevalent software development model; top-down, expert-driven creation of a monolithic software suite (the Cathedral).

The more open approach governed by co-development maxims like ‘given enough eyeballs all bugs are shallow’, and ‘release early, release often’ tends to result in a more diverse choice of solutions (the Bazaar). Mathew cited the excellent ‘101 Innovations in Scholarly Communication’ digest of tools to ease the research workflow [2], claiming that ‘researchers don’t use cathedrals they use bazaars’.

The point was to warn against ‘cathedral thinking’ in RDM. There was perhaps a suggestion that institutions tend towards that, in piecing together one-size-fits-all systems driven from the top-down by funder policy. There was also I think a definite hint that the task of integrating the many available solutions is presently under-resourced, and should not fall on the shoulders of any single supplier.

The requirements breakout that followed was facilitated by Robert Darby and Jane Williams, and took up this cue. Discussion began around the issues of ‘whose requirements?’, the challenge of prioritising them, and dealing with generic vs disciplinary tensions. Most of the conversation was on recurring ‘itches’ that the (UK) RDM community does yet have particularly settled responses to, including the core metadata requirements, ensuring that compliance needs are fulfilled, and providing secure storage for collaboration, and then getting data producers to help move the contents to an appropriately structured location inside or outside the institution.  Towards the end the discussion turned to requirements gathering approaches.

What about the pareto principle?

The difficulty in prioritising requirements that come from engagement with researchers kicked off the ‘whose requirements’ discussion. Caroline Hargreaves experience of finding that U.of Manchester research groups don’t view the research cycle in a sufficiently similar way seemed to resonate. Business analysis strategies of providing 80% of the solution by addressing the vital 20% of the problem (the ‘Pareto principle’) are very challenging to apply in this context without making some fairly hard-nosed choices. Caroline suggested several pragmatic strategies: -

  • Prioritise the needs of the biggest projects, partly because their PIs already have to deal with data sharing and standardisation issues to deliver their project, and getting PI buy-in is one of the big RDM service challenges.
  • Concentrate on shared approaches that might open doors to grant income sources that are new to particular faculty or school groups, such as cyber-security and (for health-related research) the Information Governance Toolkit.

Considering the prospect of dealing with the myriad forms of ‘smaller science’ data types and needs in a more ad-hoc way, Martin Wolf questioned whether this is really a problem, asking ‘”is thinking of RDM as ‘one thing’ holding us back?”. One answer was that the composition of ‘the RDM system’ is strongly influenced by where the RDM service sits, affecting the ‘flavour’ of what the system does. A quick show of hands gave a roughly 4:2:1 split between Library, Research & Enterprise, and IT Service.

Rather than dwell on whether any of the relevant stakeholders would be more susceptible to ‘cathedral thinking’, Torsten Reimer asserted the more pressing issue is coordination between them. That ought to happen as a result of senior management saying ‘this is a university-wide thing’. So while at Imperial the coordination post happens to be in the research office, the point of that was to get to a joined-up project plan, and from there to joined-up practice, or as close to that as one can.

Is disciplinary metadata a red herring?

The tension between generic and domain-specific needs is probably most acute when determining what the ‘core’ metadata fields are. As an aside, this is something DCC has articulated a view on. Although we have not been as vocal about it as we might, it is set out in Alex Ball’s recommendations for the Research Data Discovery Service pilot [3].

The EPrints Recollect schema offers a broader menu of choices, and Datacite and OpenAire schemas are a narrower starting point. But the fact this is a still evolving topic speaks volumes about the messiness of the problem. RDM services can be promoted as all encompassing systems, and for the metadata discussion it was a moot point whether that signifies ‘cathedral’ thinking or holistic thinking; the key point is that everyone needs discipline-agnostic metadata. Getting there is the issue.

One participant remarked that academic champions in their institution’s steering group had pointed at failed efforts to establish common schema in their own domains, but they expected that everything should gravitate towards some ‘basic’ fields. This led to the ‘red herring’ suggestion, countered by the point that some disciplines do have agreed schemas and they are very important to them, so the question remains ‘what do you take as a general approach to the specific’. Five tips from the discussion were:

  1. Focus on the ‘must do’ things; everyone needs to know what data is there, where it is, and to make it discoverable. It is more important to try to establish exchange and sharing as a way of working across the institution than worry too much about disciplinary details.
  2. Tell researchers you can and will take their domain specific metadata if they have it. Doing something is better than nothing, even if is just adding it to the deposited collection as another file.
  3. Telling researchers what disciplinary standard to adopt is unlikely to work, especially where a norm is not established. Better to drop hints than to push too hard.
  4. It is good practice to check that the description does actually identify what data is being deposited, even if it is presented as a readme file rather than structured metadata.
  5. Make a start and get some data out – if researchers see their colleagues doing it and getting a benefit from it they will take it up.

What compliance reporting metrics are needed?

Robert Darby’s question about how the group were meeting needs to monitor institutional compliance requirements – ‘it’s easy, right?’ brought some hollow laughs. If it’s sensible to target development at requirements that represent the biggest ‘itches’ this is one that might be described as sore or ticklish, depending on your disposition.

Torsten Reimer had remarked in his keynote that any institution slavishly trying to comply with EPSRC’s expectations to the letter is missing the point that institutions and the broader RDM community should be ahead of the funder’s, rather than simply responding to them. Tosten has recently written about the difficulties of OA reporting [4]. Nevertheless he was the first to admit that, however similar RDM and Open Access are in other respects, OA is technically easier to implement and the workflows and systems, including compliance reporting, are relatively more advanced.

There was a shared view that RDM compliance is not so clear-cut as OA. Whatever drives the institution’s RDM policy, the broad scope of ‘the requirements’ to implement them makes them difficult to squeeze into meaningful metrics. The discussion focused on Data Management Planning and Data access statements. It seemed sensible to focus the requirements lens where support is newer, as there are fewer entrenched institutional practices to get in the way of developing standard metrics.

The discussion started at the publication end of the research cycle, and the requirement (from RCUK) to include data access statements in publications, to marry these up with supporting data. The need for automation here was probably the most sorely scratched itch of the whole discussion, with some participants having to spend considerable amounts of time manually checking each publication for a data access statement. There was frustration in the knowledge that the information is already out there, but not flowing between the various systems within institutions, and between them and the funders and publishers. Identifiers were one way forward, provided there is support from agencies like Crossref and ORCID to link them up.

RDM services, as someone remarked, need to know about publications before they are published, and much of the information held about them ‘out there’ is already ‘out here’ in the heads of PIs. So the group’s gaze turned back to DMPs. Data Management Plans join up institutional stakeholders, and there is the potential for each of these to use the DMP as a gateway to resources and source of planning information; Jane Williams gave the example of storage at Robert Gordon University.

Nobody disputed the importance of DMPs, and some present mentioned using DMPonline to support them. However straightforward it may be to report on the numbers of DMPs submitted, there was some scepticism that it makes sense to rely on them as a source of information on later-stage RDM activity. One institutional survey had found 60% of researchers saying they had never written a DMP, and most of those who had written one never looked at it subsequently. So the idea of the DMP as a working document may have appeal for some of us, but it is some way from realisation.

How to share securely at a reasonable cost?

The requirements and cost issues around ‘file and sync’ Dropbox-style applications were a key issue for some of those in the service development phase. The use cases here ranged from a space for the PhD student to share their work with their supervisor, to large research projects needing a space to share securely with external collaborators. Several of the available third-party cloud storage solutions were discussed, mostly from the compliance and cost perspectives.

In this case the compliance issues were around Data Protection and cybersecurity, with non-EU hosting seen as problematic for human-subjects data. There was interest in the potential of CASB (Cloud Access Security Broker) software to meet institutional requirements for encryption . One participant mentioned an options paper that may be shareable (if and when it is, we’ll update this post to link to it).

Funding body compliance was recognised as having a bearing on the cost model, given that post-project storage costs cannot be charged to grants unless met up-front. In theory that should not be an issue for ‘active’ storage, but in reality the lack of awareness among researchers of institutional solutions for both active and longer- term storage meant that many were paying unnecessarily. We talked about cloud providers business models, typically a per-head charge in the case of file share and sync providers. Powerfolder was one of the offerings mentioned here as worth considering for a low per-head charge and European hosting.

The key point that Kevin Ashley made here was to take a long-term view. Whatever supplier is offering the best deal currently is unlikely to be the same 3 years hence, and it is crucial to avoid being locked in to a supplier, whether by technology, or terms and conditions that penalise migration. A useful starting point here is Jisc's framework agreements.

How can we better identify the right 'itches to scratch'?

The point was made that it is better to base solution choices on good understanding of the requirements than by what suppliers are offering. This led naturally to the final themes of the discussion; how those requirements (or ‘itches to scratch’) are being identified.

Most of those present it seemed had not employed ‘user experience’ or UX approaches such as use case scenarios and personas in developing their RDM systems. Some mentioned surveying researchers, and others using interviews and meetings to try to bring researchers in through the project process, providing opportunities to comment on specs and sign-off of the project results. Generally it seemed that services had inferred what tools or support activity would be needed from high-level policy compliance requirements, and borrowed from the approaches of more mature RDM services. Leeds University’s functional requirements for a data repository were cited as an example.

Another point was that RDM service roles are often sandwiched together from various people’s other jobs, and they do not necessarily have the collective skills or resourcing to carry out extensive co-design of their RDM solution. Researchers, it was remarked, tend to be too specific in their needs for those to quickly be translated into a generic ‘solution’. Coupled with that, the opportunity to design based on user experience may also be limited by researcher scepticism that their data can have any reuse value without attention to their specific requirements.

The discussion ended with three points that seemed pretty useful to me:

  1. Being open about what we do is the best way to deal with the issue that most people don’t have the time to go out and speak to everyone.
  2. There’s a need for useful information on tools and services, including providers of development support with open source tools
  3. Instead of engaging with users based on their organisational groupings it may be better to group them by the tools they use – perhaps echoing the ‘more user groups’ plea made by John Beaman earlier in the day.

DCC can support each of these to some extent. We have a How-to guide on discovering requirement [5]. We also plan to share examples of requirements approaches, and update our Tools and Services catalogue [6]. No doubt we should also consider support for user groups in our planning.

Getting back to the headline question, to finish with some personal conclusions, whether ‘too much variety’ in RDM solution design is spice or chaos depends on your perspective. Most of us in the requirements breakout were involved in developing solutions, or in brokering knowledge of problems and solutions, rather than researchers or other users.

Asking whether we do so from a ‘cathedral’ or ‘bazaar’ development perspective seemed to me a useful framework to start the discussion. The ‘bazaar’ approach was proposed in the context of open source software design, where a variety of modifications may evolve from a minimal prototype to a preferred solution. But only if the conditions are right, and the relative size of the development and user communities is recognised to be one of the most important of those conditions. The RDM development community within institutions is a small hard-pressed one, scrabbling around for crumbs to try to serve a very large user community with ridiculously limited resources. Perhaps that discourages co-design. It would be deeply unfortunate if it led to institutions buying into ‘glass cathedrals’ that seem to ‘do transparency’ in a clear way, but won’t allow the modification thing.

[1] Wikipedia (n.d.) The Cathedral and the Bazaar, available at: https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

[2] Kramer, B., and; Bosman, J. (2015): 101 Innovations in Scholarly Communication - the Changing Research Workflow. figshare. http://dx.doi.org/10.6084/m9.figshare.1286826

[3] Ball, A. (2014) UK Research Data Registry Mapping Schemes, available at: /projects/research-data-registry-pilot

[4] Reimer, T. (2015, Oct.30) ‘Why OA reporting is difficult’, Blog post at: https://wwwf.imperial.ac.uk/blog/openaccess/2015/10/30/imperial-college-...

[5] Whyte, A and Allard, S. (Eds) 2014. ‘How to Discover Research Data Management Service Requirements’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: /resources/how-guides

[6] DCC Tools and Services Catalogue. Available at: /resources/external/tools-services

 

Photo credit. Romana Klee on flickr CC-BY-SA 2.0