Because good research needs good data

Data curation in medical science and healthcare

Diana Sisu | 18 January 2019

1. Can you ever collect too much data to support medical research or healthcare? We are now collecting data from various devices e.g. fitness watches. Is it useful?

Actually there's very good evidence that too much data can be distracting and leads to poor judgment. So we want to be judicious about collecting data, though the real issue for us is understanding decision points in the data lifecycle where informed choices must be made. We’re looking at what to collect, what to preserve, what to include in different kinds of computations, what to add into different future studies, and what will work for patients at all decision points.

Our interest is in recognizing that data are and will continue to come from a lot of sources. Clinical observations of blood pressure, fitness trackers, genome scans, and your dental x-rays are all different types of data about your body. There's also data around exposures or what we call the exposome. We are subjected to these exposures throughout our lives including diet and lifestyle. In addition, your family genetic structure has an influence. And there’s emerging evidence that people who live in the same household begin to evidence similar genetic patterns.

Our goal is to understand the nature of data at its source. In addition, we need to recognize that we may collect data for different purposes; data collected for one purpose, such as for care, may or may not be fit for other purposes. This places some key challenges; on one hand, we may need to optimize or select for data for certain purposes in some circumstances, and, in other circumstance, figure out how to make data collected for one purpose useful for a new purpose.

Let me give you an example: People who have hypertension usually have their blood pressure monitored at their doctor’s office once a year or every few months. We know that blood pressure measurements vary throughout the day, so the very best monitor of blood pressure is measuring it throughout the day. However, if you've been stable on an antihypertensive medication for a while and your blood pressure is stable, a once-a-year assessment in a clinic will still provide purposeful clinical data. If you want to study post market surveillance of a particular antihypertensive or the blood pressure effects of a chemotherapy drug, one would require a more in-depth, more precise approach.

There will always be a need for patient monitoring that gives us good clinical insights and there'll always be a need for purpose-collected data that gives us good research insights. And then, in between, we'll have a mix of different data resources. Not every data element should be saved, and judgments need to be made about how to make use of data collected in the course of clinical care or research.

2. Big data can help medical research and healthcare and NLM/NIH are champions of big data management. I'd like to hear which are your favourite projects and why.

One of our key projects right now is, which is an international registry of clinical trials and a results reporting enterprise. allows anyone, anywhere in the world who's conducting a clinical trial to declare or register that trial according to certain structures giving us information about the number of people, what the intervention is like, and statistics. We can now collect the results of the primary outcomes so our resource is a one-stop resource for studies of clinical trials, which means that patients looking for new trials and scientists who want to know if a particular drug has been studied, can come to our resource. We're also building new libraries around data.

A second really big project that I'm personally interested in is fostering NIH's strategic plan for data management. This is going to give us both a platform for storing data as well as the tools for exploring datasets wherever they're stored in the world. An important part of this is something we call AIM, authentication and identity management. It’s important that people are who they say they are and that they have a right to look at a dataset. The National Library of Medicine, through the National Center of Biotechnology Information, is leading NIH in new ways to do this. In the past, we used identity based or role-based identity management and now we're developing mechanisms to use activity-based identity management that will be much more secure, provide more rapid responses, and reduce the burden on people who want to explore various datasets and allow them to integrate them while we preserve patient privacy.

3. Browsing the NLM site it was obvious how it is of use to medical researchers or carers. Can it be directly useful to patients and if so, how?

The National Library of Medicine is strongly committed to citizen science. By citizen science, we mean fostering the use of our resources by any professional or lay person who has a curiosity or an interest in reviewing and accessing those data resources that are open to all. Recognizing that laypeople don't have the scientific training that clinicians do, we offer tools on our website that can help people learn how to use our datasets and how to understand what data are available. Also, anyone in the world can look at PubMed to find specialists who potentially could be helpful to them.

Everything we do is open because we’re a government library. We make tools that help people learn to use our resources or find people who are experts by making consumer facing entry portals to our resources. We also modify tools and information for people's educational level.

One of our important resources for the public is There are specific, patient-level information access points in the website. We also have MedlinePlus for health professionals and patients, which is like a medical encyclopedia offering trusted health and medical information in English and Spanish.

4. How good do you think we are at sharing medical data internationally? As a lay person I can't get a feel for it. Problems I've noticed: a huge amount of duplication for data that can be made easily available (same data stored in tens of databases), and rules on patient confidentiality and commercial interests making some data difficult to access and locked in precious ivory towers (one or two databases accessible to few people).

Sharing human data internationally is a challenge because countries have different laws, protection rights, and an individual holder of the data can exert significant access and control over sharing information. We collaborate with international unions and are part of the International Nucleotide Sequence Database Collaboration - one of the three key sites that make sure genomic data is available worldwide, which is almost always for research.

In terms of sharing clinical data, it isn’t easy for several reasons, including differences in confidentiality, differences in curation and the way we label health problems and define them. I envision the next challenge over the next 10 years will be the interoperability of clinical data.

5. In the abstract for your IDCC talk you refer to de-biasing data assets. What do you mean?

The National Library of Medicine is focusing on strategies to debias existing data assets. To further explain, debiasing data means recognizing biases that exist in certain data sets (e.g. a genomic data set that includes only male samples is biased for sex). We must determine how much of the results you would get from a dataset that is built only on one population could be trusted in a manner that to extend to another population. Another example requires that we recognize the ethnocentricity of a data sets. Today many of data sets include samples drawn only from European descendants. So, you have to ask—what other populations are needed? We need to find ways to glean what we can from those datasets without making them the index or standards, recognizing the diversity of the human population.

We need to take our own datasets that were developed on a homogeneous sample and, when we can, make the results of exploring those data sets generalizable to heterogeneous populations. We're not going to be able do it with everything, yet we shouldn't just ignore 25 years of data and start over again.

We're also interested in recognizing the limits of datasets so that understanding how narrow or homogeneous a dataset can help us know what we can and can't apply to the interpretation of that data. We're going to learn where we should be investing in the future to complement what we know in the present.

6. What is your most ardent desire, your key goal in data management? It would be nice to know, even if it is just a dream at this stage.

To make it cheaper.

The cost of data acquisition, curation and reuse is beginning to impinge on our ability to do primary research. We need to make very careful decisions about what we invest in.

At the National Library of Medicine, we’re engaged in a study to apply forecasting models to figure out the long-term cost for data sustainability and how to make more appropriate decisions based on the clinical and research value of the data. As you can see, we have exciting times ahead.

Related Links

Read about Patricia's talk at IDCC 2019

IDCC 2019 Programme