Data Management & Sharing (DMS) Webinar 3: Metadata and Data Standards for NIDDK Research Data
DR. JERAN STRATFORD: Thank you, Cameron. Hello. My name is Jaren Stratford, and I’m a bioinformatist at RTI International and the moderator of today’s webinar, and I’d like to welcome you all to the third webinar in the NIDDK Data Management and Sharing Webinar Series. So, the goal of this webinar series is to catalyze an understanding with the NIDDK scientific community about the data management and sharing topics.
The first two webinars in the series that we’ve covered thus far were for topics about planning for data management and sharing, as well as an in-depth guide to selecting an appropriate repository for your scientific data. The focus of today’s webinar is metadata and data standards. Like Cameron said, please submit any questions that arise today with the Q&A function in Zoom. During the discussion section, we will answer as many of these questions about today’s topics and presentations as possible.
Some questions may also be answered within the Q&A function. A frequently asked questions document for the webinar will be posted with the recording of the webinar on the NIDDK DMS website. It will include all the answers from today’s questions, for those that were submitted and those that we didn’t have time to get to. We are very excited for today’s speakers, who will discuss the multiple facets of metadata and data standards. Today we will hear from Dr. Matt Schu, the Director of the Omics, Epidemiology, and Analytics
Program at RTI International; from Dr. Todd Valerius, an Associate Biologist at Brigham and Women’s Hospital and an Instructor of Medicine at Harvard Medical School; Dr. Sanjay Jain, a Professor of Nephrology, Pathology, and Pediatrics, and the Director of the Kidney Translational Research Center at Washington University School of Medicine in St. Louis; and Dr. Kenneth Young, an Associate Professor and the Chief Information Officer at the Health Informatics Institute at the University of South Florida.
We’d like to offer a warm welcome to all of our speakers today, and with that I will turn it over to Dr. Schu to get us started. DR. MATT SCHU: Thanks, Jeran. Just a quick audio check. Can you all hear me well? UNIDENTIFIED SPEAKER: Yes. DR. JERAN STRATFORD: Yeah, we can hear you, Matt. DR. MATT SCHU: Great.
Awesome. Well, thank you all very much. It’s my pleasure to be here. I’m very happy to talk a bit today about metadata and data standards and also present just a little bit about one of the projects that I work on as a scientist here at RTI, which is the Nutrition for Precision Health Research Coordinating Center, and I’ll talk a bit more about that later on and how the metadata and data standards are important to us in our rollout of that initiative. So, starting off a little bit, I’d like to discuss a bit of a why…kind of the motivating questions as to why it’s important to focus on having good metadata to support your met…your data, and really, you need metadata to do…to validate research findings, strengthen the analysis…that you have and combine that data with other data sets, allow for the…reuse of data that’s hard to generate and…and, through all of that, open up new discoveries when we’re able to, you know, integrate data across different studies and across different times.
Also, it’s important to have good metadata that accompanies your data to facilitate and foster trust and make sure that those…anything, any finding, that you present is reproducible, and folks fully understand this…the methods that were used to achieve that data. Next slide. Go ahead, next slide. So, that’s a bit about the why, but what is metadata? So, metadata is—very simply put—is data about data, and more specifically, it’s the information that you need in order to use the data that is in front of you, discover it, understand it deeper, and ultimately either recreate findings or combine the data that’s there with other data of similar ilk. So, you know, we can think about the…the kind of classic questions—it’s the who, what, where, when, why, and how of how that data is collected.
That’s a bit, sort of, abstract, so let’s dive into a specific example in the next slide, and I’ll talk about more about what metadata is. So…so here’s an example of data, right? It’s a picture of a puppy. It’s very cute. But there’s kind of limitations to what you can do with this photo.
It’s a nice photo. You can enjoy it. And potentially you can share it with a friend and say something about it.
But if you have deeper questions, this…this image file all by itself really doesn’t allow you to go and explore and answer any of those deeper questions you might have. And so let me kind of tell a little bit more about what those deeper questions might be as a…as a motivating example. Next slide.
So instead of just, like, a puppy enthusiast who enjoys looking at photos, let’s say you have a bit more of a…an angle that you want to take in exploring this data. And so you’re looking for a pet, and you’ve been looking at different, you know, images and identifying this is the kind of pet you want, you want to identify what kind of breed of dog that is. Again, can’t do that just from the image as is. Next slide—or hit the advance button. If you were a photographer and you were really happy with how this particular photographer captured the image and got everything in focus, you want to know more about the camera settings.
Couldn’t do that with just what’s presented right here. Next slide. Or if you’re, you know, deciding that you want to make this into a poster, for yourself or for, you know, commercial use, you want to understand like what the license agreement was for this image being out there that you found on…online. Again, none of that’s there at…at present when you’re just looking at this image. Okay, next slide. But the answers to all those questions are what would be captured in a metadata package.
So again, if you looked at what everything we’d consider metadata, you’d have everything from image title, more details about the content, again, in this case, it’s a puppy that’s being…that was…the photo was taken of. Who the…who took that data, the machines that were used to…to generate that data, and then a bit more about the format of the file, which could be important, and/or any other licensing agreements. So in this toy example here, kind of distinguishing data, kind of what you get at face value and all the deeper things that you need in order to use that data or integrate it with other things, which is what we call the metadata. All right, next slide. All right, so it’s also helpful sometimes to think of metadata in multiple different levels. So, there’s the study or data set level, which is information that’s going to describe kind of how the data is collected, and the…typically, in our cases it’s the…the science around the data generation.
So, the study name, funding information, version release, the collection protocols key for any kind of reuse you might want to do, as well as sample size and population descriptions. This would all be at the study level, metadata that we would think about when we’re talking about data and data standards and metadata standards. There’s also metadata at the variable level. And so, that’s information about the individual, variables you’re containing typically in a table. So the variable name, a description of that variable, you know, the data type. Is it Boolean? Is it numeric? Is it character? The data format and any other data sets that…that contain the variable.
Again, we’re more about the generation method. Typically, when we talk about variable level data we store it in what’s called a data dictionary, which is like a separate table that describes all the different fields that are contained in a larger table. And then finally there’s metadata that occur that we capture at the file level. So that’s the, you know, obviously the file name, the URL, the file format, a description maybe of its size and…and access information. And the important thing here is we’re going through this list and it might seem a bit mundane, but that’s probably because you’re already collecting it.
Like, this is pretty standard stuff that we collect. But if…what we really need to make sure that we’re going to talk about and hear ourselves, too, is that we present these metadata to others when we’re sharing our data writ large. Next slide. So, you know, there’s a…you can go as extremely deep when you’re collecting metadata, and, you know, the puppy example was pretty good. I don’t think I’ve ever seen a puppy photo shared with all that metadata contained.
And so, you know, knowing how…how much metadata you have to collect for any one data type is really important. And for that we…it’s important to have metadata standards. And just listed here is that, you know, while this is a bit nascent, metadata standards are beginning to form for a lot of different, commonly shared data types, including biochemical, clinical, genomics, transcriptomics, imaging, metabolomics, and proteomic stereotypes, and on the right side of the table here, I’m sharing some of the standards that you’ll find if you’re trying to, you know, adhere to best practices for sharing the metadata. And the best way to think about these standards here is just a way of kind of understanding, “What is the minimal information I need to share along with my data so that data can be reused by others downstream?” Next slide. Okay, so with that general introduction, I’ll now pivot and talk a bit more about the…the scientific initiative that I’m representing here, which is the Nutrition for Precision Health Initiative, which is powered by All of Us.
NPH is a landmark initiative that’s funded by NIH to deeply study the relationships between nutrition, human biology, and health. The study is aimed at enrolling 10,000 participants, across 14 different U.S. sites, and the goal is to gather exquisite detail regarding the volunteers’ diet, their environment, their microbiome and metabolomic signatures, and integrate all of that data to advance our scientific understanding of how…of how each of our bodies responds uniquely to food.
Next slide. DR. JERAN STRATFORD: Matt, you got about 3 minutes left. DR. MATT SCHU: Okay. Again, one of the reasons that we are so keen on adhering to good data and metadata standards is because all the data that we collect in NPH are ultimately going to be shared on the All of Us Researcher Workbench.
And that’s the goal, you know, to make this…this data very public and integrable with the other data that’s already being collected by All of Us. I should have mentioned previously that all the participants in NPH are going to be currently enrolled in All of Us, and so this is a subset of the All of Us participants, which means that all the data that we collect in our NPH initiative should be combined, should be…will have the ability to combine with the participants’ other data collected from other facets of All of Us. Next slide. So, how do we use metadata? Well, I mentioned one of the data types we have is microbiome data.
So here’s an example, a kind of very specific example of the metadata we collect around the microbiome. So, microbiome is the collection of bacteria living…grossly speaking living in your gut or on your skin. We’re focused primarily on the gut.
And we do sequencing to detect the…the flora that are in that environment, and in those sequencing files we collect, you know, the sample name…the host subject name, sample type, the sample qualities, physical specimen, collection date, and country, and all that’s really, really important—and again, for every data type, just to show that it has kind of unique metadata you might collect. Sample quality in our cases is captured by metadata, which is the Bristol school…Stool Scale. I’ve presented the text of the Bristol school scale…Stool Scale on the right there and…and politely not shown images of how we gauge what the sample collection looks like, but the idea here is just to kind of give a little bit of information of what the sample collected from the specimen from the subject looks like when it’s taken in, and that’s for triaging any kind of issues that might happen downstream when we’re doing data analysis.
Next slide. Again, this kind of tees up kind of the need for…having good standards writ large for scientific reuse and data integration. And so…pivoting from metadata to data standards, I’ll talk a bit about…about why it’s so critical to have agreement across data standards so that we can do data integration and data harmonization. Next slide. So this is a another kind of…kind of general case, but it pops up a lot, and we talk about sort of environmental factors, like smoking. And it presents a good clear way of explaining why it’s really key…really, really important to have clarity on what the data you’re collecting is.
So, here’s an example of two…two data studies where smoking was one of the environmental variables that was studied. In Study 1, participants were asked a question like, “How many packs a day do you currently smoke?” and the answers were coded 1, 2, 3, 4, at 1 being zero packs; 2 being two to…one to two packs; 3 being three to five packs; and 4 being more than five packs a day. Study 2 had another response—so, it was another multiple choice, and “What’s your current smoking status?” Never smoker, past smoker, current smoker. And again, that was coded into variables…into variables 1, 2, and 3. Now, if you were a researcher downstream looking at these two studies and just had a spreadsheet with these numbers here, you’d have no good way of understanding how you could—or if you could—combine the responses coded in Study 1 to the responses coded in Study 2.
And again, this goes back to the critical nature of having things like data dictionaries that only define, like, what the variable is and its kind of full name and description, but also how it’s collected a bit more and about what the mapping is. And so when you have consistent data standards and everyone agrees to that, those consistent data standards, you’re going to have better interpretability when…with your study. You’re going to…you know, you’ll have less lost or missing data from when we’re trying to combine data that doesn’t quite agree.
And again, a nice consistent format, which allows for better data integration downstream. Next slide. So…and this slide here, and I won’t go in the detail of many of them in the interest of time, but I’ll just note that several existing repositories and resources here describe really great data standards that are data-type specific. And we always encourage studies to go to these standards first when they’re thinking about their data collection methods because, ultimately, it’s going to allow for better opportunities for downstream data integration when everyone agrees to the same data standards at the onset for doing data collection.
Next slide. Again, here’s a really, really specific example of why it’s important, again, going back to the microbiome data and NPH. So, here is a snapshot of some of the data standards wiki we have for our microbiome data, and what you see here is just…just for the sample name, we describe exactly what that’s going to look like, the format of that text, and a description of what can be included and what can’t be included when you’re capturing that. And this is really, really important. There’s some things here that I’ll note in that while, you know, free text is…you know, if you just stop there and describing kind of what the data could be.
You really need to have the additional data for the description to show, like, what characters are allowed and what characters aren’t allowed because in the case of our microbiome data, which is going to go through the Cheetah pipeline, our software has, you know, what will effectively choke if it gets the wrong characters uploaded to it, and it could potentially disrupt the pup line…pipeline and need kind of reformatting on the back end, if folks don’t adhere to the…the sample name conventions that we outlined here in the data standards. So, another kind of…one more nuance I’ll talk about with the importance of metadata is it just allows for processing and automation when we try to, you know, leverage pipelines to do a lot of the high-throughput data analysis that we want to do for these studies. Next slide. Okay, so with that I think that’s the end of my segment. Hopefully I’m at time. Happy to take questions now or if the organizers prefer we can take questions at the end.
DR. JERAN STRATFORD: Yeah, we’re going to actually wait and take the questions at the end, but we encourage all the participants, if you do have questions, go ahead and throw them into the Q&A function. We can then address them either through the Q&A function or during the discussion section. So we really appreciate Dr. Schu for your insight there, and we’re going to pivot over now to Dr. Valerius, who will be discussing the metadata and data standards in some of his projects.
Go ahead. Thanks, Todd. DR. TODD VALERIUS: Yeah, thank you very much for allowing me the opportunity to speak to everyone here, and I realize that, you know, my institution’s gone through the data management exercise for the last few years, and it’s been remarkable how complex it can be for many bench researchers who haven’t had to deal with the…the level of detail that Dr. Schu very…did a very nice job of presenting to this audience and covering quite a bit. Because my hospital system recently underwent a rebranding, I’m required to give this…this terrible green cover slide.
So let’s move on to the next slide very quickly, so we can get to something more pleasant. Yeah, so I’m part of Circle Consortiums, and our work has been involved in combining data from very different areas into one ATLAS Center system. I’ve been part of GUDMAP since its inception, and the ReBuilding a Kidney effort I’ve been part of for the last 6 years.
Also, Sanjay Jain is going to speak after me. He and I together are the co-PIs and…and the scientific directors of a combined consortium for overseer, the ATLAS-D2K Center. We’ve been doing that for coming up on a year now, and this is the basically be the…the data repository and data discovery hub of the…both these consortiums, GUDMAP and RBK. And to actually work on bringing this data in a form that’s able to be shared across consortiums. Next slide, please.
So our overall aims are to bring this complex data into an accessible form for the community, to establish connections between the kidney and lower urinary tract data that we have with other consortiums that are related—KPMP, which Sanjay and I are both in, HuBMAP and the Human Cell Atlas. And also, we think very carefully about how we can enable researchers of various levels of experience by providing tools that they can interact with data. Not all of us are computational biologists or bioinformatists, and it can be a struggle when presented with some levels of data to actually start to work with that data to address simple questions. Next slide, please.
I’m going to have a lot of words on my slides, but they’re not meant to read at this moment, or even for me to recite; it’s so that when these slides are posted later, you’ll have a lot of information. So briefly, these two consortiums, GUDMAP and RBK have slightly different roles, but what I want to point out on the slide is we have a lot of investigators involved in each of these consortiums. There’s a lot of new technology that comes in. When we were funded the last round for this, nobody was doing single-cell RNA-Seq, and then all of a sudden we had to deal with a lot of this. And there’s a lot of new imaging technology. And I want to point out that several species are involved here.
Mouse, human, rat, and dog in GUDMAP, and then human, human IPSCs…because we have human organoid systems in RBK, and also zebrafish. So we think a lot about how to navigate not only between adult human, but also mouse and developmental stages of these other species. Next slide, please.
So what we want to accomplish. Again, serve the users that we want. We want a broad range of people interacting with the data and making use of the data. If somebody comes into your lab, you want to get them up to speed. You want to involve them, as Dr. Schu was saying, we want data maximally useful to people
broader than our own study because that’s how we build out and we accelerate science in general. So, we think in terms of example queries, you know, some detailed whether you’re a rookie or a veteran scientist on how to look at single cell RNA-Seq as…as a common example, or to take a…a GWAS gene list and then map it to expression patterns that are present in our…in our consortiums. Graphical tools, so whether these are schematics that represent data or navigate you to data that exists. Integrating molecular imaging data—if I have quantitative RNA-Seq and I want to get to where it’s expressed, for example, we mean spatial transcriptomics is still coming up. But we had a lot of historical expression patterns that are useful. Reference data sets.
This is a big thing. We’ve done a lot of this in mouse, and we’re doing more in human, and then data harmonization across consortiums. Next slide, please. So ontology has become very important as…as…and controlled vocabulary. So “data dictionary” was a term that Dr. Schu used, and that’s really common in the
clinical space. We think a lot about ontological or anatomical ontologies, and it came out very early in the GUDMAP study, we…in 2007 was our first paper on…on anatomical structures and naming them, so that we’re all calling things the name thing, the same thing. And that allows not only us to make sure we understand each other, which we do well with, but enables computation and navigation. Ontologies are structured relationships of terms. And then from that, when you’re doing data entry, you can limit what vocabularies can be used for that data so that they can meet back to an ontology and then navigate it. Next slide please.
So as an example, we can start to look expression patterns that were scored and used Boolean searches in our data then pull out very specific structures. That’s an example of well-maintained, well-curated data using a…a control of the camera. Next slide, please. And this allows us then, if we’re doing annotation of images, we can tie that to these ontological terms that have been used in RNA-Seq, bulk RNA-Seq, or whole-mount data.
Two structures we’re labeling in histological images so the computer could connect between these data sets. Next slide, please. So my recommendations here are to select a source of anatomical and cell type terms that fit your research and stick to them.
We have a lot of nice work that Sanjay’s led, really, in HuBMAP with the ASCT+B tables. This is data driven. Now, that’s limited to a healthy and semi-healthy adult human tissues of the kidney, for example, but they have many other tissues involved in the HuBMAP project, but that there…we use that as a seed to then work through the other species and update the ontology such as Uberon and Cell Ontology. Next slide, please. So data formats, you know, we talk about raw sequence. We just heard a little bit about that and image-to-image data.
Raw sequence data…there are privacy issues, but the computational analysis and alignment of those things is…all that is collected very well by the systems in place. It’s the biosample…metadata that Dr. Schu was referencing that needs to be captured at the time of the experiment. Image data is more tough.
There’s so many different image formats that one thing that can happen is that they drift and they’re locked in just, you know, whether you’re ZEISS or you’re Olympus or whatever. But the thing about standardizing your data on a common format means that people that use open-source software, like you mentioning QuPath. QuPath can benefit and the…again the biosample metadata that accompanies the technical metadata needs to be captured.
Next slide, please. These are just some examples of how, because we’ve annotated and collected that biosample information, we can connect gene expression to images. No matter what the type of image it is.
Next slide. DR. JERAN STRATFORD: Dr. Valerius, you get about 3 min left. DR. TODD VALERIUS: Thank you.
Information about complex image types. So on the left, we have 3D type imaging, and then on the right we have more traditional imaging. Next slide. And you go ahead and play the videos, this is some of the challenges that we have is where people are doing…they come upon a technology that’s very powerful, and they start doing nano-CTs, and then we have to deal with that data.
So, being able to adapt as a consortium is very important. Next slide, please. So again, the recommendations develop a plan to capture biosample metadata and protocol and capture the bio samples of and decide on an image format. I’ll point out that libraries at most institutions can be very helpful in this area. Next slide.
So I’m going to talk about sequencing data a little bit, and why don’t we go ahead and skip ahead to slide 17. Another one. Yeah. So, back one.
So, by putting in different layers of sequencing, we can avoid the privacy issue a little bit by focusing on process data. RAW sequencing data can reveal information about participants, and that needs to be protected. That’s great, but it also creates a little bit of friction in using the data. So, we think a lot about standardized approaches to be able to generate counts data or R objects, Seurat objects, if you will, that can be used directly by researchers. But again, a key to that, as Dr. Schu was pointing out with data dictionaries, you need
to very well understand how that was generated so that you can incorporate it into your work. Next slide. Let’s go, next slide. So, brought…briefly, what happens with references change if you’re going to work with downstream data? Realignment may be necessary.
So, we went through this in GUDMAP and found that because we had good metadata in a bunch of bulk RNA-Seq, we could take 10 years’ worth of data and redo an analysis to a common reference genome, bring all that data up to date, and then be able to…to use that data now in common visualizations. And that was very powerful because we had the background of all that metadata. Next slide.
Some of our findings, and you can read this later. We had a lot of errors, and a lot of successes—very few successes—and we went through and fixed this data because we were working with the data. Next slide. And again, just to point out in the upper right…upper right area here, you see that we were able to do these visualizations of a bulk RNA-Seq that weren’t possible before until we brought all that data in the line, we had fresh QC data, and we had it all on the same reference genome. So, that made that data, even though it was 12 years old, useful to people. Next slide.
I’m almost done. So again, I’ve put the recommendations at the bottom about capturing your code, capturing pipelines, sharing pipelines. HuBMAP’s doing a very good job of this, and we’re using…we have some of those people on our team to be able to publish pipelines that people can use so that people understand how process data was generated from raw data, and hopefully won’t need the raw data and won’t have to go through data use agreements, etc. They can get to work immediately.
And then finally I want to just, last slide. Next slide. Just this…this these are the people that are involved in the ATLAS-D2K team. Sanjay and I get to do some of the fun stuff with bio…biocuration, ontology. And then various members, the Kesselman, Blood, Kretzler and Börner teams from across different consortiums, including KPMP and HuBMAP, allow us to bring all this together, and we work quite hard on exchanging data between consortiums so that we have harmonization.
And I’ll leave it there, and thank you very much for your attention. DR. JERAN STRATFORD: Thank you, Dr. Valerius. So now we’re going to switch over and hear from Dr. Jain.
DR. SANJAY JAIN: Hi, can you hear me? DR. JERAN STRATFORD: Yes, yes, thank you. DR. SANJAY JAIN: Okay, thank you. Do you have the slides? DR. JERAN STRATFORD: Yeah, we’re working on those right now. We want to remind everyone that, yes, please use the Q&A function for any sort of questions that arise.
We’ll try and answer those during the discussion section that will be coming up. DR. SANJAY JAIN: Apologies give me just a second. I am frozen. DR. JERAN STRATFORD: So, while we’re waiting,
Dr. Valerius…oh, looks like we got slides. DR. SANJAY JAIN: Okay, great. Thank you so much. Yeah, good afternoon, everyone, and thank you for the opportunity for the organizers. And thank you for my predecessor and especially partner in crime, Todd Valerius, to set up the stage for me to talk about how we’re using quality control and harmonization in data standards to understand the human kidney in…in the different projects.
So, what I will do is basically give you an example of our work on making a kidney atlas from both healthy and diseased tissues, and next slide, please. So…so the process that we are using is to make a human kidney atlas is really defined on three consortia, where we are using spatially and single cell resolve technologies to understand human kidney both in health and disease. So the HuBMAP consortium is, mainly defining the healthy kidney, the kidney precision medicine project, the deceased kidney specimens, and we are doing this across lifespan with a new project with Pediatric Center of Excellence in Nephrology. And just to define what at least I’m thinking at a very broad level about atlas projects is really a collection of spatially resolved morphological and molecular maps, which help understand the vital functions of the kidney, and the generate knowledge that will provide insights to prevent kidney injury and promote recovery.
So…so with these goals, I’m going to go to the next slide to lay out, you know, what I’ll be talking about. So, the three main bullet points I want to mention about the process that we follow, one is the quality assurance and control and giving example of follow the tissue pipeline. The second one is assay/data harmonization, and particularly how that helps us to make bridges across different types of data and rigorous experimental results. And the third one is basically showing example of how, by establishing this process, that we are able to drive critical insights from cross-species integration for cellular diversity and time courses.
So in the bottom left is basically giving you at least in the kidney space, there are at least seven different initiatives that are trying to understand kidney from different angles. And on the right side is just example from KPMP, different major technologies trying to understand biomolecules in both single cell as well as in spatial contacts in 2D and 3D. So we have many assays, many institutions, and many user personas that we have to keep in mind when we are generating data so that the community can make maximal use of our findings. Next slide. So let me just go first for the metadata and nomenclature and really give few points on quality control. Next slide.
So taking an example of single cell technologies. So, this is a tall order, which the process starts from tissue acquisition and then many different steps on the left, which are highlighted away from isolation all the way to the end to sharing the results with the community. So for each of these steps, there are many choices that are to be made. For example, when you’re collecting a tissue, especially in a tissue, you have to think about the procedure that is being used. And with that procedure you have to think about, you know, how that impacts the data that you’re going to generate.
And then therefore for each of these choices, there are challenges that some of them have highlighted on the right side are going around. So all these choices have impact on the data type that you will generate and share and you know, and then problems in each of these steps, if they are not quality control, you can end up basically with data that’s not usable, which basically means like trash in and trash out. So we want to basically keep the FAIR principles in mind and quality control at each of these steps.
Next slide. So here’s an example of what we are thinking about in HuBMAP and Kidney Precision Medicine Project. So the idea of quality control is to really generate data that’s reproducible and rigorous. So we want to minimize technical variations and then try to understand so that we can really uncover biological variation that is contributing to data, and we want to make the data and design experiments so that the data can be reproducible, the experiments are done…designed rigorously, and we are transparent about our protocols that you heard from Todd Valerius, so that community can access them and…and then implement in their own studies.
So in follow the tissue pipeline we basically starting from the participant, all the way where data is organized, in many cases, this central repository or data hub that you heard about, the principles that guide those things, and the biomolecules can range from RNA, Metabolite, proteins, or even DNA. So at a subject level, there are many…and when you do human studies, it’s very important to capture the metadata…clinical metadata, and associated data that are related with the subject. Because these have impact on interpreting end results. We want to minimize confounds, and so that we don’t make conclusions that are not just, based on our observations. So I’m listing some of these categories of…of clinical parameters that are…metadata, clinical metadata that are associated with subject.
The second level is specimen, and this is key, you know, because all of the high throughput data, -omics, technologies are coming from specimens. And these are pre-analytical variables that go from procurement, processing, preservation, determining the integrity of the tissue such as pathology, composition and so forth. And also storage condition that shipping because some of the analytes could be sensitive to any of these and we can’t really predict, you know, ahead of time what will impact. So the more you record and document, the better it is, and new technologies are coming down the line, so we don’t know how these parameters will impact those.
So it’s important to capture these things. And same thing when you’re doing your assays, whether it’s OMX or imaging assays, you want to also capture the analytical components, whether it’s instrumentations, whether it is markers that we’re using, and whether there is software and what quality control cut-offs that you’re using. And to list those, so that…the users can interpret the data that you have generated. And then even in the repository, you have many different metadata parameters, such as harmonization, data visualization tools, software versions, file formats that Todd talked about, and then the repository that are hosting the data, and what the sustainability is. So, once this process is placed, next slide please, the…I want to give you an example of…just a sampling of why that’s important.
So many of us are using samples, especially for human studies from different sources, so it is critical to be able to provide accurate registration and mapping from where the sample comes. And this is important because when you’re dealing with clinical samples, they’re very small and we are generating data from usually one point of time from a…very small samples. So it is important to document, like for example in this case in the kidney, we have regions A, B, C mark in the whole kidney sections over here, and it tells us where that section is come from, and in the middle panel in the lower side, here’s Medula, so we’re basically registers that on a kidney diagram, and then we have all the molecular data on the right side that comes then, so we can eventually link all of this -omics data, the markers to the exact location at the organ level so we could go across different scales from the human body all the way to the molecular level. So same way with the clinical data, if you have sample registered, then you can know the variability that may be due to different regions that are contributing to your data. So you can interpret that properly. So this is an example why tissue mapping registration for the location where you get your sample is important, including the sex, including the laterality left, right, and the region.
Next slide. DR. JERAN STRATFORD: Dr. Jain, you got about three minutes left. DR. SANJAY JAIN: Okay, I want to talk about assay integration.
Next slide. So…so I talked about…tissue quality in data. So in this table, I’m just highlighting a few points of how to harmonize when you have multiple assays that are listed over here.
So, in sequencing, imaging, people use different cutoffs. We have…we have different genome reference versions, so it’s important to document…document those. We have…in the bottom left, we have certain controls that we [inaudible]—so, assay drift controls, software used and data type used, file formats that Todd…Todd touched about.
So I’ll talk about a little bit about assay drift controls. So when you are doing omic assays over time. So you want to make sure that you don’t see variations because of origins that gets changed or the software analytical pipelines change.
So in KPMP, what we have established is this assay performance control. So these are Levey-Jennings plots that basically plot over time each sample and some QC…metric that you can use, for example, genes for…nucleus in single-cell data, and the circled dots are basically standardized tissue that we use every batch or every other batch to ensure over time that the assay is performing well. So this actually really helps us to gauge any deviations outside the two center division line over here and then go back to a look at a metadata if something was contributing or changed that is affecting our assay. So, that’s one example.
Next slide. So, here’s an example of nomenclature harmonization. This is very key because when we have data sets generated from different labs and people have different anatomical contexts and cell types, so we want to make sure people mean the same thing when they’re talking about it. So in this table, what I’ve highlighted is anatomical structures.
We are standardizing by using ontologies from Uberon at different anatomical levels so people know what structures are, you know, are being mentioned about. In the bottom is cell type names, and then for this we use cell type ontology, so every cell type that is discovered or known, then we…we call it and reference by cell type ontology so the community knows which cell type you’re talking about, what it means. And same thing for biomarkers, for example, for genes use HUGO ontology so that every gene has a unique identity, so that we will know what we are talking about. Next slide.
So I want to give you an example of, atlas efforts in the human kidney now. Next slide. So, when we go in ATLAS efforts, one of the things is there to define what we mean. So this is just an example to show when we’re talking about mapping cells in the…human body, we define what a healthy state is, and we define what are the non-healthy references.
So as long as you have good definitions that are backed by literature and data…so community would know what you mean when you’re talking about certain subtypes. Next slide. So in making the kidney atlas, so now we have established a process of infrastructure of nomenclature cell types, so we in…so, what I’m going to describe briefly is, you know, human kidney from different sources. They have five different technologies. Each one creates maps. And then because we have these unified names that bridge across different technologies, we can integrate to make an atlas.
Next slide. So in the Human Kidney Atlas, it was a joint effort for many different consortia. Here I’m showing you basically an example on the lower left from many different technologies and patients from LD or AKI/CKD patients. There are multiple technologies listed over here. Imaging technologies in the bottom, and in the top right is many single cell technologies, from single health sequencing and chromatin states. Because we have now crossed…have established the nomenclature and also the data formats, we can merge these data sets and identify cellular diversity in the human kidney.
Next slide. So, once we have done this, then we can also go across species, starting…in the left example over here is mouse data that is controlled experiment over different time points for ischemia-reperfusion injury. So now we have cell types defined, so we can take this data and project on the human to understand where the mouse actually correlates with the human, and then actually resolve the mouse cell types into the human. By doing this, we can actually better understand where correlation between mouse and human and what is the biology of this…injury. Next slide.
So for this example, I basically have separate panels over time course of the mouse, and then what is showing basically in these different panels is the injury states, for example at 4 hour…hours, there’s an altered injury state in TAL2 that’s coming up in the mouse, which also correlates in the human. And in the bottom 6 weeks, we can see which injury cells were processed and which have resolved, and…which can help us understand better the time course when we get a single specimen from human biopsies. So, this was an example because…possible because we were able to harmonize data formats, as well as nomenclature. Next slide.
So I just want to show you the resources that are available where these data can be accessed. It’s important for it to come to the community. On the left is a reference from HuBMAP, where people can put their own data sets from single cell, and the…and the program will basically annotate your cell types. In the middle, one is that you can look put your genes of interest, run by cell by gene, and it will tell you…associated with better data where the gene is expressed and then you can do some…limited differential expression analysis.
The Atlas Explorer is from KPMP, and this allows you to cross multiple technologies, look at different kidney disease specimens, and understand the expression in spatial distribution. And on the right…side is a dynamic ASCT table from the kidney, which tells you about nomenclatures and markers across different ontology databases, so people can use that as a reference. Next slide. So, this is obviously a very big team-size collaborative effort that Todd mentioned before. I’ve listed many of the contributors over here, especially the ones in these different technologies that have generated the first kidney single-cell spatial atlas version one, which will be published next month in Nature, and many different supporting agencies, and the grants that are listed over here.
Next slide. I think that’s my last slide. Thank you.
DR. JERAN STRATFORD: Thank you, Dr. Jain. That was great. Dr. Young, we’ll kick it over to you. DR. KEN YOUNG: All right, great.
Good afternoon, everyone. You can…no, actually stay on this slide. Thank you for the opportunity to present.
I’m Ken Young, the CIO and Assistant Professor at the University of South Florida’s Health Informatics Institute. I lead the IT team, for the TEDDY Data Coordinating Center. And today, I’ll be discussing an overview of The Environmental Determinants of Diabetes in the Young—the TEDDY study—and data types, developing metadata and data standards in the TEDDY study, and how implementing data and metadata standards adds value for data sharing and reuse beyond the life of the TEDDY study. Next slide. So, TEDDY is designed as a prospective cohort study of over 8,000 children enrolled before 4 1/2 months of age and followed for 15 years to identify genetic and environmental triggers of type 1 diabetes.
The TEDDY cohort consists of children identified to be of increased genetic risk, who either had a parent or sibling with type 1 diabetes, a first-degree relative…or not the general population. The TEDDY study investigates genetic and genetic environmental interactions, including gestational infection or other gestational events, childhood infections, or other environmental factors after birth, in relation to the development of prediabetes autoimmunity and type 1 diabetes. The TEDDY study participants are followed for 15 years, for the appearance of various beta cell autoantibodies and diabetes, with documentation of early childhood diet, reported and measured infections, vaccinations, and psychosocial stressors.
Other participants are no longer followed once they reach the study endpoint. Next slide please. The TEDDY study is a multinational study that recruited participants in Europe and the United States. So, we have data coming in across the globe.
Next slide. The TEDDY Clinical Centers are located in Colorado, Finland, Georgia and Florida, Germany, Sweden, Washington State, and the Data Coordinating Center is in Florida. It’s where I am located.
Next slide. So a variety of different data types are collected as part of the TEDDY study, including clinical metadata and laboratory test results, data across various -omics analytes. Here at the TEDDY DCC, we manage, curate, and integrate, and provision these data assets for analyses by TEDDY and approved external investigators.
And we have nearly over 1 petabyte of data that’s stored at the DCC. Next slide. TEDDY has a vast amount of diverse clinical metadata. And an important aspect that we have for documentation is our TEDDY website. We provide documentation for researchers that includes a clinical metadata overview and collection summary.
And listed here are some of the TEDDY clinical metadata that we collect…over time. Next slide. TEDDY also contains a large amount of -omics analytes, and again, our TEDDY website, our public website provides documentation for researchers, and also includes an -omics data overview and -omics control data summary.
And I think someone just posted the TEDDY website…I was going to get the URL. But thank you for doing that. Next slide. So, data standards are very important. As the use of data standards enables reusability of data elements and their metadata that can reduce redundancy between systems, thereby improving reliability and often reducing costs.
And to increase interoperability, the TEDDY study implemented the following biomedical ontologies: for adverse events, we use CTCAE version 5 0. We also used ICD-10 and SNOMED CT for diagnostic information, and RxNorm for medications. On top of TEDDY, we also at the…health and vaccine institute have other studies and for the clinical research studies where data are submitted to the FDA, we use, standardized data sets for CDISC, study data, tabulation model, the STDM, and analysis data model AdaM. Next slide. To improve the quality and reusability of the data our electronic case report forms our eCRFs, were designed to capture certain data standards directly.
As you see on this slide, we directly capture the ICD-10 codes right on the eCRF form. And this is vital for research and analyses, as you’ll have these ICD codes instead of free text. Next slide. TEDDY’s…as I stated, it’s…we’re studying participants up to 15 years. So, it’s been a long study. And in addition to using standard ontologies, TEDDY…we also created these unique TEDDY codes to capture other clinical data in a standardized way.
Now, this limits the use of free text fields, as that can be more challenging and time consuming to clean and analyze. So, this approach provides a standard way to collect data that may typically be freeform and difficult to analyze. And over time, researchers can request additional codes to be added. Next slide, please. A…a key component of our data sets are these data dictionaries that we provide.
So, we have TEDDY data dictionaries, and they contain metadata that provide additional information intended to make scientific data interpretable and reusable. So, to ensure the data are findable and reusable, each clinical data set TEDDY shares is accompanied by a data dictionary, which contains the variable name, type, length, and label. Providing sufficient, well-structured metadata is a key component of abiding by the FAIR principles, and our data dictionaries can be provided in multiple formats—RTF, CSV, Word, Excel—and I know CDISC and others are also starting to incorporate JSON, so that’s something that we could also look at, like JSON and XML. The metadata here provides information about one or more aspects of the data, and it summarizes the basic information. So, it can be used for tracking and working with the specific data, much easier for researchers.
The data dictionaries include metadata on methodology procedures used to generate the data. Variable name, label, the definition length and type, and other relevant information that are needed to interpret or link the variable to other data. Next slide. Here it’s depicted a data dictionary for our -omics data, as these dictionaries may vary by type.
The examples of dictionaries may be the manifest files, which are generated by the instrument manufacturer, and also annotation files. Depicted here is the Illumina Manifest for a human HT-12 version 4v chip, used to generate the TEDDY microwave gene expression data. Next slide please. DR. JERAN STRATFORD: Dr. Young, you got about 2 minutes left. DR. KEN YOUNG: Okay. And here is the data dictionary for the metabolomics data.
Next slide, please. So Direct to Investigator Data Releases also receive release notes describing the data freeze date, population, data sets provided, and any relevant notes for investigators. The…the release notes metadata summarize the information about data releases that can make tracking and working with specific data easier. Next slide, please. The TEDDY -omics metadata is also shared with data repositories, such as dbGaP and the Metabolomics Workbench. The Metabolomics Workbench is housed at San Diego’s Supercomputer, and it…Center, and it serves as a national and international repository for metabolomics data and metadata.
And also, dbGaP is a database of genotypes and phenotypes. Next slide, please. Another key component to TEDDY and what we’ve done for standardizing areas of our application and our data are these annotated forms and our eCRFs. These TEDDY eCRFs are annotated with a data set name at the top and a variable name by each field. And they’ve been shared with the NIDDK central repository as a searchable PDF. The annotated forms allow investigators to find data of interest, see how it was collected, and identify related variables.
Next slide, please. As…as someone already keyed in, our TEDDY public website has valuable information for investigators, so they can find documents detailing our TEDDY data collection procedures, data availability, data sharing policies, and so on. And these are updated periodically, once we’ve made more data available.
Next slide, please. At the DCC, we develop an efficient and comprehensive platform for the acquisition, management, integration, analysis, and sharing of scientific data assets to handle these large data, especially with the -omics research. Next slide, please. The TEDDY study has adopted policies and procedures in support of its commitments to sharing data with the scientific community, while also protecting the privacy of participants. Our data releases have been submitted at different time points in the various repositories, depending on the NIH requirements and nature of the data. Each submission is treated as an independent release, possessing uniquely masked subject and sample identifiers.
Researchers may desire to combine data crossing releases for analysis, but are unable to do so as a result of the independently masked identifiers. The NIDDK repository can provide repository data release identifier mapping materials to satisfy this demand, once the investigators had received approval to access the data. And our TEDDY data are shared to dbGaP, Metabolomics Workbench, and the NIDDK repository. Next slide. And, I just want to acknowledge the individuals, like Dr. Krischer, Dena Tewey, Chris Shaffer at the Health Informatics Institute and who have helped with this presentation, and also the TEDDY Study, and also, of course, the NIH and the whole entire TEDDY study group, for making the TEDDY study possible, and the NIDDK for this great opportunity.
Thank you. DR. JERAN STRATFORD: Thank you, Dr. Young, and thank you to all of today’s speakers for sharing your insights and your experience. We’ve got a…a great insight into what metadata is and the importance of metadata and data standards. So, now we’ll transition over to the discussion portion of the webinar, and we’ll ask all of our presenters to come back on camera. During this section, we’re going to take a deeper dive into specific metadata and data standards topics and address as many of the questions as we can from the attendees.
So, let’s start off our discussion with the question of: When is the best time during the study life cycle to consider data standards and metadata collection to maximize the value of the study and to minimize the effort required. Dr. Young, do you want to start that? DR. KEN YOUNG: I would say…right in the study design, it’s key to start with data standards. And then as you progress in transition, these standards come out, things change over time, but I would say the most optimal time is before the study even begins, during a study design.
DR. JERAN STRATFORD: Dr. Jain? DR. SANJAY JAIN: Yeah, I would agree, early as…you know, as possible, like, maybe conception of the project, writing a proposal. I think you want to have that plan in mind, and then…so that you can see which…who…which are the collaborating sites or individuals that you want to reach out early on to get that expertise. DR. JERAN STRATFORD: Great.
So the next question we’d like to jump to is: You know, when we encounter a lack of established or widely accepted standards, vocabularies, ontologies, you know, it becomes really challenging. How can we address this? Dr. Valerius, do you want to comment here? DR. TODD VALERIUS: Yeah, I agree. It’s really difficult, and because there are so many, and what I wrote in the answer that I wrote was about just picking something that works for you from the resources available.
And if you use the terms that are related to that, you know, very good example here is that Sanjay and others have done a lot of work in creating these ASCT+B tables that feeds into Uberon, the Uberon ontology, and into the cell ontology. So, the domain expertise from KPMP and from GUDMAP is flowing into there. So that…that improves those ontologies. If you use those terms and all terms have IDs, that means as that ontology changes over time, if…no matter what you’re using, if you’ve captured that, there will be a mapping…ability there. So, you can create a subset of what you’re going to use for your work from an existing ontology, and just make that…even if it’s Excel spreadsheet, just make that what you’re going to use within your laboratory, and you’ll be in much better shape than…than most of the data going in.
You know, I want to point out, I remember that…the Human Cell Atlas, when COVID hit, they did a large scale…they wanted to just dive in—this is Sarah Teichmann and Aviv Regev—What data can we find across the globe? Single-cell RNCA was their focus that we can use to sort of think through ACE2 expression, etc. I remember a comment when they were presenting the data, that they spent about 80% of their time just trying to understand the metadata, and how the data was collected, etc. So imagine if you’re just…even if it’s 70%, if you’ve made that improvement, then you’re getting that data faster and more complete and you don’t have to throw out as much. So every little bit helps. DR. JERAN STRATFORD: Yeah, thank you so much. We’d like to thank all of the presenters, that we’ve had today, and we’d also like to thank our audience for attending this webinar and engaging with us in the Q&A function.
As a reminder, a recording of today’s webinar will be available on the NIDDK DMS website. We will also be posting a Q&A document that has responses to the questions we received today, including those we didn’t have time to address. We would also like to encourage everyone to join us for our next webinar in the series, which will focus on data reuse, which is the R in “FAIR.” That… webinar will be held on July 13 from 12:00 to 1:00 Eastern Time. Registration is currently open on the NIDDK DMS website.
We have some really exciting topics coming up, and we encourage you to join us for those webinars, as well as fill out the feedback request form. We hope to see you in our future sessions, and we thank you all for your attendance and hope you have a great day. DR. SANJAY JAIN: Thank you.