GTN Smörgåsbord - Day 4 - Proteogenomics 2: Database Search
Hello! My name is Andrew Rajczewski and I'm a graduate student in the Griffin and Tretyakova labs at the University of Minnesota and a part of the Galaxy Proteomics group. As a part of this year's Galaxy Training Network Smörgåsbord, I'd like to present you with the second of three tutorials in a series on proteogenomics applications in Galaxy. In this tutorial, I will take you through how to search mass spectrometry data against a database to perform proteogenomic analyses. A prominent approach in modern biological research is systems biology, wherein a plurality of biomolecules in a system are measured simultaneously to ascertain the response of that system to various stimuli. This is accomplished
through multi-omics technologies, each of which measure a different biological molecule. The contents of a genome, for example, are determined through sequencing and a process known as, known as genomics. Similarly, the degree of genomic methylation in chromatin architecture is determined through epigenomics. The totality of gene transcription can be examined through mRNA sequencing, a process known as transcriptomics. And finally, the phenotype of a system can be more directly ascertained through omics-technologies such as proteomics and metabolomics, which measure the proteins present in a system and the small molecules produced by these proteins respectively.
Our lab's focus is on proteomics and on the analysis of proteomics data in Galaxy. Modern proteomics approaches utilize mass spectrometry to identify, and at times quantify, the proteins in a biological sample. In conventional proteomics, experiment proteins are digested enzymatically into shorter constituents called peptides for ease of downstream analysis, a technique called bottom-up proteomics. This results in an exceedingly complex mixture: to analyze them all at the same time would be akin to trying to judge the makeup of an entire crowd all at once. Just as this crowd could be separated using cues and turnstiles, the peptide mixture is separated using liquid chromatography, resulting in a few peptides entering the mass spectrometer at a time. Within the mass spectrometer, the mass-to-charge ratio of the peptides is first measured before the peptides are then fragmented into smaller pieces. The mass-to-charge
ratio of these pieces are then measured in a separate event called an MS2 spectrum. Once MS2 spectra are collected, the mass spectra can then be searched against a reference database using bioinformatics to determine the peptides, and therefore proteins, present in the sample. When the peptide fragments in the mass spectrometer, the peptide breaks up in a predictable way along the peptide backbone, giving a series of ions made up of the peptides fragmented at discrete locations on the backbone. Within the proteomic software then these ion series are treated as a sort of fingerprint and compared against the theoretical peptides within the database, looking for a match that would correspond to these ion series. Once a spectrum is annotated with a peptide sequence, it is designated a peptide spectral match, and assigned a score depending on the quality of the match. These PSMs can
then be assembled to identify proteins. In this way, thousands of proteins can be identified, and potentially quantified, in a single sample. Ideally, identifying all the peptides in a sample would be as simple as optimizing the collection of data by the mass spectrometer. However, it is important to note that there could be spectra within the data set that might be, that might originate from an unannotated portion of the proteome. By using only canonical reference for proteome basis databases in bottom-up proteomics, it is theoretically possible that an enormous amount of biological information is lost. This can be corrected for by making custom databases that contain the canonical reference protein, in addition, to experiment specific extra data. For example, any sequence variants that are not present in the reference proteome
can be identified using a six-frame translation of the genome, three-frame translation of the, of cDNA, or a protein database derived from RNA-seq experiments. This leads to identification of single amino acid substitutions, frameshift mutations, and alternate, alternative splice isoforms, also known as proteoforms. The use of transcriptomic and/or genomic data to supplement proteomic analysis and identify novel proteoforms is termed proteogenomics, and this is the approach my colleague and I hope to communicate to you in these tutorials. Performing proteogenomic analyses, as it has been with most multi-omics analyses, has historically been somewhat taxing, requiring considerable time and computational finesse on the part of the analyst. Fortunately, many tools used in proteogenomic analyses have been uploaded into Galaxy where even the most novice bioinformatician can readily use them in a simple graphical user interface. What's more, the tools can be utilized together in workflows, allowing for the for analyses to occur automatically, when given the starting data sets, ultimately saving the analysts time. This graphic
represents a hypothetical workflow containing all the requisite steps in proteogenomic analysis. For this tutorial, I will be focused on this section highlighted in red, where mass spectrometry data is searched against custom databases to identify putative novel peptides. Other steps highlighted here are covered in the other tutorials done by my colleagues.
The searching of mass spectrometry data against the custom database for proteogenomic analysis has been isolated, and painstakingly simplified, into a compact, straightforward workflow, shown here. For the remainder of this session, I will be going through the individual nodes in this workflow, so that you may better understand what they do. After which, I will walk you through the process of importing data for this tutorial and running the workflow itself, so that you may be able to practice on your own. Begin, let us discuss the input files needed to run the proteogenomics database search workflow, as this workflow cannot be run without all the requisite inputs. To run this workflow, you will need three input files, each in the correct format. The first input file, or files that are needed, are the raw mass spectrometry data for the experiment you mean to analyze. It is important that the files be in the Mascot Generic Format
or MGF. If they are not, the files cannot be converted using the open sort the, the files can be converted using open source tools like msconvert, which can be found right in Galaxy. The second input file needed is a custom database for this experiment, generated from suitable RNA-seq data to reflect the alternate proteoforms not found in the conventional proteoform in the conventional proteome. This database is in the FASTA format with all the proteins
in the sample expressed in one letter codes with a unique identifying header on each. The generation of this database is covered in the tutorial created by my colleague, James Johnson, which I encourage you to watch before this. The final file needed is the second FASTA database, containing the reference proteomics sessions for the system you are analyzing, which will be utilized near the end of this workflow. In addition to the conventional proteome,
this database can also contain common protein contaminants, such as keratin from human and animal sources to avoid misattribution of these contaminants to alternative proteoforms. As with the custom FASTA database, this is generated by the first proteogenomics workflow. Now that we've established the requisite files needed to run this workflow let's discuss the first node in the workflow, where we use SearchGUI. The SearchGUI engine is arguably the heart of this workflow, as this is the node that searches the raw data against the custom FASTA database. SearchGUI was developed by the Martens Group
to effectively perform searches of protein mass spectrometry data against FASTA files. Over the years, many different proteomics search algorithms have been developed, each of which has its own unique advantages and disadvantages. Ideally, these would be run, multiple of these algorithms would be run; however, that can cause, result in considerable time and computational power needed. SearchGUI allows for multiple search engines to be run at the same time, maximizing
the ability to interrogate the data without the time commitment needed for sequential searchings. In addition to being able to perform multiple searches simultaneously, users can also adjust the settings in SearchGUI to accommodate differences in experiments. Specific digestion options can be chosen, selecting from several different potential enzymes, as well as varying the number of missed cleavages. In addition, the peptide precursor options can be adjusted to account for different mass spectrometers’ resolutions and optimize the ability to identify PSMs. Finally, post-translational modifications, such as oxidation, acetylation, or phosphorylation, can be adjusted in SearchGUI so that chemical modifications of the amino acids can be accounted for when searching through your data. Once PSMs have been identified using SearchGUI, the results go to this next node, PeptideShaker.
PeptideShaker is in many ways a companion piece to SearchGUI, having also been developed by the Martens group. The SearchGUI results include all potential PSMs generated from your data, regardless of the quality of the match of the spectra to the putative peptides. To account for this, PeptideShaker will filter out the PSMs that are below a certain False Discovery Rate, or FDR, set by the user, leaving only the highest quality PSMs. SearchGUI can also output data files in the form of simple tabular lists, as well as the mzIdentML file, which we will discuss later in its use in subsequent analyses.
Having identified PSMs in the data, the next step in the workflow involves the use of two steps which will filter out peptides belonging to conventional proteoforms and leaving novel peptides for our analysis. While we are able to identify high confidence PSMs using SearchGUI and PeptideShaker, with proteogenomics we are interested in the novel peptides not found in the normal proteome, which are invisible to conventional bottom-up proteomics approaches. At this node, we utilize two Query Tabular steps that remove all peptides corresponding to those proteins in the reference proteome, such as normal proteins and contaminating peptides. This node of the workflow will, therefore, leave behind only those novel peptides that are unique to this sample. The second Query Tabular step will filter out all those peptides that are either too large or too short to be seen by the mass spectrometer. This workflow also includes nodes that are necessary for downstream analysis: the first is this node denoted mz to SQLite.
One of the outputs of PeptideShaker is a mzIdentML file, which stores peptide and protein identification data. This node can convert the mzIdentML file produced in PeptideShaker into the mz-SQLite format, which is needed in the interrogation of peptide spectra using the multi-omics viewing platform, or MVP. As shown here, uh, this is an output known, uh, this is an output tool known as Lorikeet, wherein the quality of the spectra can manually be ascertained by the user and is important for the generation of figures, and confirming that the peptides are, indeed, real. The previous node discussed, the final note in this workflow is necessary for the use of novel peptide analysis workflow, as presented by Subina Mehta in her tutorial. In the third proteogenomics workflow used for novel peptide analysis, there is an initial BLAST-P step that is used to further analyze the peptides identified in this workflow.
To do this, the data output from the previous Query Tabular filtering steps must be converted to the FASTA format which is done at this node. For this to work properly, it is worth checking before you run the workflow that the title column is set to column 1 of the previous output, and the sequence column is set to column 2. Now that we've discussed the components of this workflow, let's go through how to run the workflow in Galaxy, specifically the Galaxy Europe instance. Okay, hello again! Here we are in the Galaxy European instance, and we're going to go through how to perform, uh, the database search for proteogenomics workflow. Starting at the very beginning, let's go ahead and create a history for this, simply by clicking here, and then we'll give it an appropriate name: call it GTN database Search or rather GTN Proteogenomics Database Search, and then I like to include the date as unambiguous pattern as, as possible. So we're going to obviously need a few things to run this:
we're going to need the appropriate input files as we discussed before, we're also going to need the workflow in order to do this. To start, let's uh, include our let's, let's add in our, um, our RNA-seq custom database, as well as our reference annotations. Now, um, when you do this ideally, you would go ahead and run through the tutorial, as performed by, uh James Johnson, or JJ, but to save time I'm just going to import the results of that workflow in here for our own use, like so, and now back at it.
So that takes care of these two aspects of our workflow, but we still need the raw mass spec data in the MGF format for this experiment. To now, to find that, we're going to go into shared data, and we're going to go to data libraries. Now we want, uh, since this is for the Galaxy Training Network, we want GTN material, down to proteomics here, and proteomics data search, and it's going to go to this link here.
So, these are all assorted files that are, uh, used for the various project proteogenomics workflows, so we're going to focus just on these, uh, four fractions here, and we want to select all four of these, and we're going to export to history. Now it's very important that for this we import them as a single collection so that they can be run and give a single output for all four files together. Run them as a collection here, uh current selection, and we're going to import them into the history that we just made.
This is fine, here's a list, so, we'll go ahead and add that. So, these are all the files that we want, and we will call these something like GTN_MGF. Now, we'll just go ahead, go back to our workflow here. And, behold, we have our workflows here, our, our dataset here, uh, with all 4 items in it, in the collection. So now we want to go ahead and import our workflow so that we can do our database search. Now to do that it's fairly straightforward: again we go to shared data, we go to workflows, and of course, this applies to any and all shared data and shared workflows, but, so for us we want this workflow here, this GTN Proteogenomics2 Database Search, so we're going to import this.
We can use it anytime, and we'll go here to start using this workflow. Right, so this is my, uh, workflow menu, and here's the imported, uh, proteogenomics workflow here, so let's open it up and take a look. Right, so it's just as we, I went through in the tutorial, the PowerPoint slides, you've got your three input files here, including the custom database and the reference protein accessions from the, uh, Proteogenomics Workflow 1, as well as our MGF collection. So this is all set up to be run as a collection, all the more reason to import them as a collection. You can, of course, convert the individual data sets into a collection once they're in your history, but it's just as easy to import them as a collection. Then, to SearchGUI, PeptideShaker as before,
mz to SQLite, to generate our mz-SQLite file for database visualization. Then here are final nodes of filtering steps, and then conversion to FASTA. One thing I'd like to know here, uh, for this exercise, um, this is currently using the version 1.1.0 version of Tabular, Tabular-to-FASTA, sorry. We're going to want to convert that to the most recent, uh,
version here. So we'll just save that, uh, save our changes, that, once you have your workflows obviously, you can add tools and take them away, and as you change settings you can save them, or write them, copy them to a new file if you want to preserve the original workflow. So, uh, without further ado, we can go ahead and run this workflow here. That is going to trigger all the, um, datasets in our history, and it's going to populate them into the, uh, relevant fields here. So, let's just go ahead and go through the individual settings one by one, or the individual tools and nodes to make sure that everything's correct. So, uh, here we've got our, uh, our custom database made from the Proteogenomics Workflow 1, that is where it needs to be, we've got our MGF collection, uh, here in the right spot, and then our reference protein accession, so that's all good. So now, each of these here represents an individual tool or node in the
workflow, so we can just go ahead and click on the expand/collapse to look, uh, more specific things. As I alluded to before, you can modify the enzyme, the digestion enzymes, depending on the parameters of your experiment. I know that for this dataset trypsin, with two miscleavages, is appropriate and that's arguably the most common, uh, setting the, the most common conditions for bottom-up proteomics, so you probably won't even have to change that. Um, one thing I think is worth pointing out here is the protein modifications: you've got fixed modifications and you've got variable modifications. Fixed modifications are these sorts of post-translational modifications you expect to be there because more than likely you put them there yourself. In the case of this, we've got selected, uh, Carbamido, Carbamidomethylation of cysteines, excuse me. This is an extremely common part of
any bottom-up proteogenomics workflow, where before you digest your proteins you will add a reducing agent and then an alkylating agent to get rid of any disulfide bonds, and then cap your cysteine to prevent the disulfide bonds from reforming, this just aids in digestion. Uh, importantly for this experiment, we've also got this, uh, modification here, iTRAQ 4-plex of lysine, as well as iTRAQ 4-plex labeling of the peptide N-terminus. iTRAQ is an iso, what's called an isobaric tag, it's useful for quantitating different experiments that are then concatenated together. This is something that would have been deliberately done in the processing step, so we're going to include it in the fixed modifications.
At the same time we've got a few variable modifications, these are things that might occur as a part of a biological process within the cells: something like a phosphorylation or an acetylation, or in the case here, we have these sort of chemical reactions that are almost, uh, can be considered side reactions, or things that just sort of happen. For example, uh, processing protein variably introduce a degree of oxidation to methionine, so it's important to include that, as well as, um, iTRAQ is generally meant to react with primary amines, such as lysine and the peptide N-terminus, but it can also sometimes react at tyrosine and it's worth including that as well. In addition, uh, we've got PeptideShaker here, where we've got it creating an mzIdentML file, which is important for our workflow, and just the default methodologies here. Then, down here we've got our, um, first of two Query Tabular steps where we manipulate our, our data. Essentially it's
going to take outputs from PeptideShaker, and then filter it based off of the, uh, contents of our reference accession numbers. So if it's, what, what this code means is, uh, SQL, if anything is found within the reference proteome accession numbers, it's going to be removed from the PeptideShaker outputs. Then this is all fine, good, this is a similar PepQuery step, or a Query Tabular step, where uh anything more, any peptides less than six amino acids long, or more, more than 30 are going to be removed, just because those tend to not, uh, give as good spectra within the mass spectrometer, so it's sort of, uh, easier to remove them. You'll spend less time sort of chasing those in the long run. And finally, here at the bottom, we've got our, uh, Tabular-to-FASTA step, which we need to do, uh, BLAST-P analysis down the road. Now, if you look here, you'll see that there's a title column, and a sequence column, just like we talked about in the, um, PowerPoint slide, uh, portion of the tutorial, uh, this you want to make sure it's, as we said, you want to make sure that it’s populated, so that the title column is one, sequence column is two, so, uh, that's pretty much everything you need. So, we're going to go ahead and cue up this workflow. Now those of you familiar with Galaxy will recognize this, as as, uh, steps are cued, and jobs begin to happen here.
Now, I didn't want to go ahead and make you sit here and wait for this all to be done, so I did go ahead and run this workflow in advance, just to make sure everything worked. So let's go ahead and take a peek at the finished product, here so we will go into our, uh, history collections and switch to the previous one that was run successfully. Right, so here is our, uh, successfully run, uh, Database Search workflow that we've run here. So first we have the, uh, output of SearchGUI; um, that is just going to be a rather large file that has all the information in it, um, and that feeds directly into these outputs for peptide, that feeds directly into PeptideShaker that gives us these outputs here. Um, you've got your mzIdentML file, uh, which will be useful down the road, our parameters for analysis, as well as our, uh, PSM report. Now, this is sort of where things start to get interesting. So, you've got here all the proteins that were
identified in the system, er, rather all the peptides that were identified here: here's the peptide sequences, the proteins that they come from, you can see that some of them are common to multiple proteins, then a bit more information about modified sequence, the variable modifications, and fixed modifications. This proceeds on to all this other information, uh, charges of the peptides, theoretical mass, precursors, etc, etc. Uh, from there, um, we get this mz-SQLite file here, which is useful for visualizing, uh, proteins; we can do that here, just a, by way of example, we open up this MVP application. It can take a minute. Right, so, this will give us a potential spectra here that you can visualize down here, so this is maybe not the best one. Anyway, uh, back to this, we see our PepQuery tabular, our first Query Tabular step here.
Um, we remove, uh, those proteins that are not found in the, remove those proteins that are found in the reference database, and it cuts it down from, I think, several hundred or a thousand, um, lines here, let me just see. Oh yes, we've, we've gone from over 5000 lines to, to just nine that are unique to our system and not found in either the reference proteome or in the contaminants. You can see here they're annotated by these sort of, um, more unusual accession numbers here. From there, you can go ahead and see
that some of the ones that are too large and small are removed where we have just our simplified ID and the sequence. Then finally, our last step we get this FASTA file here where the column on the left is used as the header, and then the sequence is used as the sequence. So, uh, this is a successful invocation of this, um, workflow and now you've got, um, I think, yeah, sorry: you've got six sequences of novel peptides that can be used in the subsequent proteogenomics workflow for analyzing novel peptides. Right, uh, just that's pretty much it on how to run this, now let's just jump back into the slides for a minute. With that, I'd like to remind you that the other members of the GalaxyP team at the University of Minnesota are also giving tutorials today. On the subject of proteogenomics itself, there is an introductory presentation from Dr. Tim Griffin.
In addition, James Johnson and Subina Mehta have created excellent proteogenomics tutorials on the generation of custom databases and novel peptide analysis respectively. Beyond proteogenomics, Dr. Pratik Jagtap has given a talk on metaproteomics applications in Galaxy, which is not to be missed. Supplementary tutorials for proteogenomics can be found at the Galaxy Training Network, at the address shown below. I hope you found this tutorial useful, and hope that you will embark on your own proteogenomic experiments in the future. Thank you for your attention, and enjoy the rest of the Smörgåsbord.