Sharing Data in Biomedical and Clinical Research
"I firmly believe that openness and transparency is in the best interests of science. And it's in the best interest of scientific careers as well." -- Andrew Vickers
There has been vigorous discussion in the scientific literature about the need and value of sharing full data sets from biomedical and clinical research, but it's rare to see the issue get headlines in the mainstream media. In August, an article in The New York Times put the spotlight on a $60 million clinical study of Alzheimer's disease because of its innovative approach to data management: Clinical and imaging data collected in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) were made available immediately for scientists to download and analyze.
The data sets have been downloaded thousands of times, 160 papers using the data have been published so far, and 80 more are in the pipeline, Michael Weiner, principal investigator of ADNI, says in an interview with Science Careers. Making data transparent and available "so that other people can analyze the data and discover different things, [is] going to accelerate all of science," he says. "It's a relatively inexpensive way to get more value out of all of the work that we do."
However, ADNI's open clinical data-sharing policy is exceptional. "There has been a culture in biomedicine of not sharing data," says Andrew Vickers, associate attending research methodologist at Memorial Sloan-Kettering Cancer Center in New York City. "I think that culture has to change. And it's going to take young investigators to change it. I firmly believe that openness and transparency is in the best interests of science. And it's in the best interest of scientific careers as well."
Some fields already have standard data-sharing practices, but not biomedicine. Guidance is particularly lacking when it comes to sharing data from clinical trials and pooled from electronic health records. This article presents expert advice, suggestions, and resources aimed at answering key questions about sharing clinical and biomedical data:
When designing your study, you should discuss these issues with your mentor and your institutional review board, and seek out your institution's and funding agency's specific rules and regulations.
Several funding agencies have policies that support data sharing and encourage investigators to make their data available. Some journals state that sharing data from studies is required. However, these policies usually don't have a penalty for not complying, "so in some sense they're voluntary," notes Heather Piwowar, who studies data sharing as a postdoc funded by the DataONE cyberinfrastructure project.
The U.S. National Institutes of Health (NIH) makes a broad statement of support about sharing data in its grants policy statement: "NIH endorses the sharing of final research data to serve these and other important scientific goals and expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers."
Investigators who apply for NIH grants of $500,000 or more must include a data-sharing plan with their grant application. These plans should indicate how data will be shared or explain why it cannot be shared. Grant review panels don't consider the data-sharing plan when evaluating an application, but once a grant has been funded investigators are expected to keep their data-sharing promises. "Data-sharing plans that are accepted become a term and condition of the award. The researchers can be held to their data-sharing plan," says J. P. Kim, director of the Division of Extramural Inventions and Technology Resources within the NIH Office of Extramural Research in Bethesda, Maryland.
For genetic association studies, the NIH requirements are stronger: NIH-funded investigators conducting "genome-wide analysis of genetic variation in a study population are expected to submit to the NIH genome-wide association studies (GWAS) data repository descriptive information about their studies for inclusion in an open access portion of the NIH GWAS data repository," the policy states. A frequently asked questions document about the policy says this also includes NIH-funded clinical trials that have a genetic association component. The NIH repository for GWAS data is dbGaP.
Journal policies on data sharing vary, but they typically urge authors to deposit specific types of data in their relevant repository. Here are some excerpts from Science's instructions for authors:
"Appropriate data sets (including microarray data, protein or DNA sequences, atomic coordinates or electron microscopy maps for macromolecular structures, and climate data) must be deposited in an approved database, and an accession number or a specific access address must be included in the published paper. We encourage compliance with MIBBI guidelines (Minimum Information for Biological and Biomedical Investigations). ...
Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or only when this is not possible, on an archived institutional Web site, provided a copy of the data is held in escrow at Science to ensure availability to readers."
Another example from Cancer Research: "Authors of manuscripts with new nucleotide or amino acid sequences must deposit the sequence information with GenBank. ... Authors must submit the relevant accession numbers for deposited sequences with the manuscript and these will be published with the article."
Before you begin a study, check with your funding agency, institution, and target journals about their policies on sharing data -- and any possible restrictions on doing so.
Sharing data increases the transparency of the scientific process, says Weiner, who is director of the Center for Imaging of Neurodegenerative Diseases at the Veterans Affairs Medical Center in San Francisco, California. "Most data is collected by investigators. They write papers, they post papers, but the raw data and the data trail that leads to the papers is invisible." Access to raw data sets brings higher visibility to that data trail, and it allows the opportunity for scientific results to be independently tested and verified.
Weiner adds that the open-data policy in the ADNI study has meant that the data have been subjected to far more analyses than they would have if only a small collaboration was allowed to access it. "My colleagues and I are so busy [administrating the project] that sometimes we just don't have time to write the papers we think ought to be written, and other people are doing that," he says. "It's wonderful to see the data get analyzed."
Others cite the fact that the integrity of the research data may improve. "A robust regime of data sharing would make scientific misconduct a lot harder," says James Miller, an attorney and visiting scholar in the Department of Health Policy and Management at Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland.
Investigators who share their data have the satisfaction of contributing to those broad scientific advantages, but it can be difficult to see the advantages to them individually.
"It's probably the biggest question asked: 'What's in it for me?' " says Nicholas Anderson, assistant professor of biomedical health informatics at the University of Washington, Seattle. "What's often in it for them is collaborations, funding, and being more visible in the community by being more available."
For example, if you share or are willing to share a particular data set, a researcher who wants to study that data may invite you to collaborate on an analysis you wouldn't have pursued yourself. "I think we're seeing a lot more new investigators forming collaborations perhaps earlier than their senior peers because they have to," Anderson says. "They don't have the experience in informatics or regulatory or ethics or statistics, so they form affiliations and they really bootstrap things."
Sharing data may also increase how often your work is cited, particularly as standards for citing data take shape. Piwowar conducted a study that found that journal articles presenting cancer microarray clinical trials for which the investigators had made their data publicly available were cited about 70% more frequently than those from investigators who did not share their data. "There is evidence of citation benefit in some subdisciplines," Piwowar says. "I think that citation benefit will go up as we standardize on ways to cite data sets and as treating data sets as first-class entities becomes the norm."
You should discuss your plans for sharing your data with your mentor and your institutional review board to address informed consent, patient privacy, and IRB oversight for your study. In addition, NIH maintains a collection of links and resources for extramural researchers on its Research Involving Human Subjects Web page. NIH addresses patient-protection issues in the online booklet Protecting Personal Health Information in Research. Below is some general information on the topic.
The Health Insurance Portability and Accountability Act (HIPAA) is designed to protect patients' personal health information. A patient may give informed consent for his or her health information to be used in a particular clinical study, but that applies only to the research question outlined in the informed consent document. That patient's clinical data cannot be used in the context of another study if the patient's identifying information (such as name, hospital record, or date of birth) remains linked to the patient's clinical data.
This puts limits on sharing data that contain protected health information. "Even in cases where HIPAA would not prevent the sharing of data, finding out whether or not it does is time consuming and complicated, so there's a tendency for some researchers to say, 'If there's any possibility that I could run afoul of HIPAA, I'm simply not going to share my data,' " Miller says.
However, once identifying information has been removed from data, the data are no longer subject to the rules of the privacy act, nor are they restricted by the terms of the original informed consent. "According to federal rule, de-identified data is not subject to IRB overview," Vickers says. "IRBs are there to protect patients, and this is not a patient-protection issue. I do advise people to speak to their IRBs just to confirm."
There are two ways to de-identify data, according to the parameters of HIPAA. First, you can remove 18 specific identifiers from the data record, which include things such as name; a geographic location smaller than a state; all dates related to the individual such as birth date, admission date, or date of death; social security number; and medical record number.
Or, if it's not possible to remove all identifiers, researchers can use statistical methods to mask the identifiers. "If for some reason you need date of birth, you can add jitter to it ... so it's still statistically valid," Vickers says. "If there's a date that's critical, maybe the date of surgery, you add a little bit of random noise to it." In that case, a qualified statistician must review the data and certify that the risk of identifying individual patients in it is very small.
"To de-identify 99% of data sets takes 5 minutes," Vickers says.
Vickers and colleagues provide further guidance on de-identification of data in their guidelines for preparing raw clinical data for publication.
You should think about how you will manage and ultimately share your data from the earliest stages of designing your study. "Prospective design is critical," Anderson says. "I don't know how I can stress enough that [you need] early understanding of the knowledge structure and management of a clinical trial or a research experiment that has some alignment with both existing data, ownership of the data, and expectations for both analysis and sharing of it."
The NIH Web site sharing.nih.gov includes a sample data-sharing plan and key elements to consider for data sharing, both of which contain useful points to consider even if you're not applying for NIH funding. Some sample questions from the key elements document:
-What types of data are to be collected in the study and shared (such as genetic, physiological, or clinical)?
-What data documentation will be shared so that others can understand and use the dataset without misuse, misinterpretation, or confusion?
-Will a new repository need to be developed, and if so, who will maintain the repository?
-Will the data be distributed directly by an investigator to those who request it (e.g., through an electronic file)?
-What steps will be taken to help researchers know that the data sets exist?
These questions give some indication of the sorts of issues you should be grappling with when you start to design your study. The U.K.-based Wellcome Trust maintains similar documents for its grant applicants, a guidance on preparing data-sharing plans and a Q&A on data sharing, which may provide additional points to ponder when developing your own data-sharing plan.
Consult your own institution and funding agency about their specific data-sharing requirements. Or consult the BioSharing Web site, which maintains a list of several funding agencies' data-sharing policies.
As you think about how you will collect and manage your data, consider what reporting standards apply to your type of study and your specific field. Many subfields don't yet have such standards; this has long been a problem in clinical and biomedical research, and researchers in many subdisciplines are working to develop such standards.
Organizations are developing some global data standards such as those developed by the Clinical Data Interchange Standards Consortium. Also, data annotation standards have been developed for, for example, autism research, neuroscience, and cancer research. There are reporting standards for specific scientific techniques, such as the MIAME guidelines for microarray data. (A list of more standards is available from the BioSharing Web site.) Ensuring that your data conform to established standards will help ensure the utility of your data set to other researchers.
Vickers and colleagues have published guidelines for preparing raw clinical data for publication, which offer suggestions for nearly every step in the path, from data collection to publication, with an eye toward sharing the data with other researchers. "We realized that science is very, very heterogeneous and it's impossible to sit in a room and predict all the sorts of data types you could have," Vickers says. "What we said is that people should provide data and code and that the data and code should be sufficiently well annotated that a competent statistician could replicate the main results in the paper."
Researchers recognize that it's almost impossible to standardize certain types of data. But even if that's true of your data, you should make sure your data are available in a format that's useful to other investigators. "If a researcher makes the patient-level data available in a PDF format, those data are basically worthless," Miller says. "You have to make data available in a data set that people can download into their statistical package of choice."
Finally, you should consider how and where to post your data. Repositories exist for certain kinds of data, such as Proteome Commons for proteomics data and dbGaP for GWAS data. But there is no single repository for clinical and biomedical data -- which is as it should be, several experts interviewed for this article say. Dryad is a repository for data sets for peer-reviewed, published articles in basic and applied biosciences, and Sage Commons is for integrative genomics and disease modeling. Several interviewees noted the Dataverse Network Project, which can serve as a mechanism for managing data and sharing it, either by uploading data to the IQSS Dataverse Network or by downloading the Dataverse software and creating your own repository.
Small data sets can be published as supplements with the corresponding journal article. But many data sets are too large to post as supplementary data, and others still contain sensitive information about patients and so cannot be posted publicly. In addition, data supplements may not be a durable solution for sharing data: In a 2006 study, Anderson and colleagues looked at online data supplements accompanying a subset of articles indexed in PubMed and found that 17% to 29% were no longer available -- some as soon as 1 year after publication.
A lot of common data-sharing methods, such as putting data on a university department's Web server, have proven to be unsustainable. "I've seen so many grants that say, 'We're going to make it available on the faculty Web server,' " Anderson says. "You know that that probably won't remain true for long -- not for any nefarious reason, just because someone has to do it, that person may quit, or something might change."
That's why Kim and others recommend repositories for sharing data. "The best way to share is to put it in an appropriate repository because that way the data is automatically taken care of," says Kim. "The data would only be shared appropriately. It also alleviates the burden on the PI from having to fulfill data requests continuously."
Researchers need a broad understanding of informatics to deal with their study data. "I would recommend that [investigators] familiarize themselves with how information is beginning to be shared and structured, from discovery systems to outcome-data capture to common surveys to HIPAA, and what the constraints are, as early as possible, such that they can be more strategic about it," Anderson says.
But when you need specialized knowledge, you should reach out to experts and collaborate with them. "It's hard to do any of these trials on your own," Anderson says. "You're being reviewed by interdisciplinary teams, and you're competing with interdisciplinary teams. So you have to form interdisciplinary teams."
Kate Travis is the editor of CTSciNet, the Clinical and Translational Science Network, an online portal for career development in clinical and translational research produced by Science Careers.