GTN Smörgåsbord - Day 4 - Proteogenomics 2 Database Search

Show video

Hello! My name is Andrew Rajczewski and I'm a  graduate student in the Griffin and Tretyakova   labs at the University of Minnesota and  a part of the Galaxy Proteomics group.   As a part of this year's Galaxy Training Network  Smörgåsbord, I'd like to present you with the   second of three tutorials in a series on  proteogenomics applications in Galaxy.   In this tutorial, I will take you through  how to search mass spectrometry data against   a database to perform proteogenomic analyses. A  prominent approach in modern biological research   is systems biology, wherein a plurality  of biomolecules in a system are measured   simultaneously to ascertain the response of that  system to various stimuli. This is accomplished  

through multi-omics technologies, each of  which measure a different biological molecule.   The contents of a genome, for example, are  determined through sequencing and a process   known as, known as genomics. Similarly, the degree  of genomic methylation in chromatin architecture   is determined through epigenomics. The totality  of gene transcription can be examined through mRNA   sequencing, a process known as transcriptomics.  And finally, the phenotype of a system can be more   directly ascertained through omics-technologies  such as proteomics and metabolomics, which measure   the proteins present in a system and the small  molecules produced by these proteins respectively.  

Our lab's focus is on proteomics and on  the analysis of proteomics data in Galaxy.   Modern proteomics approaches utilize mass  spectrometry to identify, and at times quantify,   the proteins in a biological sample. In  conventional proteomics, experiment proteins are   digested enzymatically into shorter constituents  called peptides for ease of downstream analysis,   a technique called bottom-up proteomics. This  results in an exceedingly complex mixture:   to analyze them all at the same time would be akin  to trying to judge the makeup of an entire crowd   all at once. Just as this crowd could  be separated using cues and turnstiles,   the peptide mixture is separated  using liquid chromatography,   resulting in a few peptides entering  the mass spectrometer at a time.   Within the mass spectrometer, the mass-to-charge  ratio of the peptides is first measured   before the peptides are then fragmented  into smaller pieces. The mass-to-charge  

ratio of these pieces are then measured in  a separate event called an MS2 spectrum. Once MS2 spectra are collected, the mass spectra  can then be searched against a reference database   using bioinformatics to determine the peptides,  and therefore proteins, present in the sample.   When the peptide fragments in the mass  spectrometer, the peptide breaks up in   a predictable way along the peptide backbone,  giving a series of ions made up of the peptides   fragmented at discrete locations on the  backbone. Within the proteomic software   then these ion series are treated as a sort of  fingerprint and compared against the theoretical   peptides within the database, looking for a  match that would correspond to these ion series.   Once a spectrum is annotated with a peptide  sequence, it is designated a peptide spectral   match, and assigned a score depending on  the quality of the match. These PSMs can  

then be assembled to identify proteins. In this  way, thousands of proteins can be identified,   and potentially quantified, in a single sample.  Ideally, identifying all the peptides in a   sample would be as simple as optimizing the  collection of data by the mass spectrometer.   However, it is important to note that there could  be spectra within the data set that might be,   that might originate from an unannotated portion  of the proteome. By using only canonical reference   for proteome basis databases in bottom-up  proteomics, it is theoretically possible that   an enormous amount of biological information is  lost. This can be corrected for by making custom   databases that contain the canonical reference  protein, in addition, to experiment specific   extra data. For example, any sequence variants  that are not present in the reference proteome  

can be identified using a six-frame translation  of the genome, three-frame translation of the,   of cDNA, or a protein database derived from  RNA-seq experiments. This leads to identification   of single amino acid substitutions, frameshift  mutations, and alternate, alternative splice   isoforms, also known as proteoforms. The use of  transcriptomic and/or genomic data to supplement   proteomic analysis and identify novel proteoforms  is termed proteogenomics, and this is the approach   my colleague and I hope to communicate to you  in these tutorials. Performing proteogenomic   analyses, as it has been with most multi-omics  analyses, has historically been somewhat taxing,   requiring considerable time and computational  finesse on the part of the analyst. Fortunately,   many tools used in proteogenomic analyses  have been uploaded into Galaxy where even   the most novice bioinformatician can readily  use them in a simple graphical user interface.   What's more, the tools can be utilized together in  workflows, allowing for the for analyses to occur   automatically, when given the starting data sets,  ultimately saving the analysts time. This graphic  

represents a hypothetical workflow containing all  the requisite steps in proteogenomic analysis.   For this tutorial, I will be focused on  this section highlighted in red, where mass   spectrometry data is searched against custom  databases to identify putative novel peptides.   Other steps highlighted here are covered in  the other tutorials done by my colleagues.

The searching of mass spectrometry data against  the custom database for proteogenomic analysis has   been isolated, and painstakingly simplified, into  a compact, straightforward workflow, shown here.   For the remainder of this session, I will be going  through the individual nodes in this workflow,   so that you may better understand  what they do. After which,   I will walk you through the  process of importing data   for this tutorial and running the workflow itself,  so that you may be able to practice on your own. Begin, let us discuss the input files needed to  run the proteogenomics database search workflow,   as this workflow cannot be run  without all the requisite inputs. To run this workflow, you will need three  input files, each in the correct format. The first input file, or files that are needed,  are the raw mass spectrometry data for the   experiment you mean to analyze. It is important  that the files be in the Mascot Generic Format  

or MGF. If they are not, the files cannot be  converted using the open sort the, the files   can be converted using open source tools like  msconvert, which can be found right in Galaxy. The second input file needed is a custom database  for this experiment, generated from suitable   RNA-seq data to reflect the alternate proteoforms  not found in the conventional proteoform   in the conventional proteome. This database  is in the FASTA format with all the proteins  

in the sample expressed in one letter codes  with a unique identifying header on each.   The generation of this database is covered in the  tutorial created by my colleague, James Johnson,   which I encourage you to watch before this. The  final file needed is the second FASTA database,   containing the reference proteomics sessions  for the system you are analyzing, which will   be utilized near the end of this workflow.  In addition to the conventional proteome,  

this database can also contain common protein  contaminants, such as keratin from human   and animal sources to avoid misattribution of  these contaminants to alternative proteoforms.   As with the custom FASTA database, this is  generated by the first proteogenomics workflow. Now that we've established the requisite  files needed to run this workflow   let's discuss the first node in the  workflow, where we use SearchGUI.   The SearchGUI engine is arguably the heart of  this workflow, as this is the node that searches   the raw data against the custom FASTA database.  SearchGUI was developed by the Martens Group  

to effectively perform searches of protein  mass spectrometry data against FASTA files.   Over the years, many different proteomics search  algorithms have been developed, each of which has   its own unique advantages and disadvantages.  Ideally, these would be run, multiple of these   algorithms would be run; however, that can cause,  result in considerable time and computational   power needed. SearchGUI allows for multiple search  engines to be run at the same time, maximizing  

the ability to interrogate the data without the  time commitment needed for sequential searchings. In addition to being able to perform  multiple searches simultaneously,   users can also adjust the settings in SearchGUI  to accommodate differences in experiments.   Specific digestion options can be chosen,  selecting from several different potential   enzymes, as well as varying the number of  missed cleavages. In addition, the peptide   precursor options can be adjusted to account  for different mass spectrometers’ resolutions   and optimize the ability to identify PSMs.  Finally, post-translational modifications, such as   oxidation, acetylation, or phosphorylation,  can be adjusted in SearchGUI so that chemical   modifications of the amino acids can be  accounted for when searching through your data. Once PSMs have been identified using SearchGUI,  the results go to this next node, PeptideShaker.

PeptideShaker is in many ways a companion piece  to SearchGUI, having also been developed by the   Martens group. The SearchGUI results include  all potential PSMs generated from your data,   regardless of the quality of the match of the  spectra to the putative peptides. To account   for this, PeptideShaker will filter out the PSMs  that are below a certain False Discovery Rate,   or FDR, set by the user, leaving only the  highest quality PSMs. SearchGUI can also   output data files in the form of simple  tabular lists, as well as the mzIdentML file,   which we will discuss later in  its use in subsequent analyses.

Having identified PSMs in the data,   the next step in the workflow involves the use  of two steps which will filter out peptides   belonging to conventional proteoforms and  leaving novel peptides for our analysis. While we are able to identify high confidence  PSMs using SearchGUI and PeptideShaker,   with proteogenomics we are interested in the  novel peptides not found in the normal proteome,   which are invisible to conventional  bottom-up proteomics approaches.   At this node, we utilize two Query Tabular steps  that remove all peptides corresponding to those   proteins in the reference proteome, such as  normal proteins and contaminating peptides.   This node of the workflow will, therefore,  leave behind only those novel peptides that   are unique to this sample. The second  Query Tabular step will filter out all   those peptides that are either too large or  too short to be seen by the mass spectrometer. This workflow also includes nodes that  are necessary for downstream analysis:   the first is this node denoted mz to SQLite.

One of the outputs of PeptideShaker is  a mzIdentML file, which stores peptide   and protein identification data.  This node can convert the mzIdentML   file produced in PeptideShaker into  the mz-SQLite format, which is needed   in the interrogation of peptide spectra using  the multi-omics viewing platform, or MVP.   As shown here, uh, this is an output known,  uh, this is an output tool known as Lorikeet,   wherein the quality of the spectra can  manually be ascertained by the user and   is important for the generation of figures, and  confirming that the peptides are, indeed, real. The previous node discussed, the final  note in this workflow is necessary for   the use of novel peptide analysis workflow,  as presented by Subina Mehta in her tutorial. In the third proteogenomics workflow used for  novel peptide analysis, there is an initial   BLAST-P step that is used to further analyze  the peptides identified in this workflow.  

To do this, the data output from the  previous Query Tabular filtering steps   must be converted to the FASTA format which is  done at this node. For this to work properly,   it is worth checking before you run the  workflow that the title column is set   to column 1 of the previous output, and  the sequence column is set to column 2. Now that we've discussed the components  of this workflow, let's go through   how to run the workflow in Galaxy,  specifically the Galaxy Europe instance. Okay, hello again! Here we are  in the Galaxy European instance,   and we're going to go through how to perform, uh,  the database search for proteogenomics workflow.   Starting at the very beginning, let's  go ahead and create a history for this,   simply by clicking here, and then we'll give it an  appropriate name: call it GTN database Search or   rather GTN Proteogenomics Database Search,  and then I like to include the date   as unambiguous pattern as, as possible. So we're  going to obviously need a few things to run this:  

we're going to need the appropriate  input files as we discussed before,   we're also going to need the  workflow in order to do this. To start, let's uh, include our let's, let's  add in our, um, our RNA-seq custom database,   as well as our reference annotations. Now, um,  when you do this ideally, you would go ahead and   run through the tutorial, as performed  by, uh James Johnson, or JJ, but   to save time I'm just going to import the  results of that workflow in here for our own use,   like so, and now back at it.

So that takes care of these two aspects of our  workflow, but we still need the raw mass spec data   in the MGF format for this experiment. To now,  to find that, we're going to go into shared data,   and we're going to go to data libraries. Now we want, uh, since this is for the Galaxy  Training Network, we want GTN material,   down to proteomics here, and proteomics data  search, and it's going to go to this link here.

So, these are all assorted files that are, uh,  used for the various project proteogenomics   workflows, so we're going to focus just on  these, uh, four fractions here, and we want to   select all four of these, and  we're going to export to history.   Now it's very important that for this we import  them as a single collection so that they can   be run and give a single output for all four  files together. Run them as a collection here,   uh current selection, and we're going to  import them into the history that we just made.  

This is fine, here's a list,  so, we'll go ahead and add that.   So, these are all the files that we want, and  we will call these something like GTN_MGF. Now, we'll just go ahead, go  back to our workflow here. And, behold, we have our workflows here,  our, our dataset here, uh, with all 4   items in it, in the collection. So  now we want to go ahead and import our   workflow so that we can do our  database search. Now to do that   it's fairly straightforward: again we  go to shared data, we go to workflows,   and of course, this applies to any and all shared  data and shared workflows, but, so for us we want   this workflow here, this GTN Proteogenomics2  Database Search, so we're going to import this.

We can use it anytime, and we'll go here to start  using this workflow. Right, so this is my, uh,   workflow menu, and here's the imported,  uh, proteogenomics workflow here,   so let's open it up and take a look. Right, so it's just as we, I went through  in the tutorial, the PowerPoint slides,   you've got your three input files here, including  the custom database and the reference protein   accessions from the, uh, Proteogenomics Workflow  1, as well as our MGF collection. So this is all   set up to be run as a collection, all the more  reason to import them as a collection. You can,   of course, convert the individual data sets into a  collection once they're in your history, but it's   just as easy to import them as a collection.  Then, to SearchGUI, PeptideShaker as before,  

mz to SQLite, to generate our mz-SQLite file  for database visualization. Then here are final   nodes of filtering steps, and then conversion  to FASTA. One thing I'd like to know here,   uh, for this exercise, um, this is currently  using the version 1.1.0 version of Tabular,   Tabular-to-FASTA, sorry. We're going to  want to convert that to the most recent, uh,  

version here. So we'll just save that, uh, save  our changes, that, once you have your workflows   obviously, you can add tools and take them away,  and as you change settings you can save them,   or write them, copy them to a new file if  you want to preserve the original workflow. So, uh, without further ado, we can  go ahead and run this workflow here.   That is going to trigger all the,   um, datasets in our history, and it's going  to populate them into the, uh, relevant fields   here. So, let's just go ahead and go through the  individual settings one by one, or the individual   tools and nodes to make sure that everything's  correct. So, uh, here we've got our, uh,   our custom database made from the Proteogenomics  Workflow 1, that is where it needs to be, we've   got our MGF collection, uh, here in the right  spot, and then our reference protein accession,   so that's all good. So now, each of these here  represents an individual tool or node in the  

workflow, so we can just go ahead and click on the  expand/collapse to look, uh, more specific things.   As I alluded to before, you can modify the  enzyme, the digestion enzymes, depending on   the parameters of your experiment. I know that  for this dataset trypsin, with two miscleavages,   is appropriate and that's arguably the  most common, uh, setting the, the most   common conditions for bottom-up proteomics, so  you probably won't even have to change that.   Um, one thing I think is worth pointing  out here is the protein modifications:   you've got fixed modifications and you've got  variable modifications. Fixed modifications   are these sorts of post-translational  modifications you expect to be there because   more than likely you put them there yourself.  In the case of this, we've got selected, uh,   Carbamido, Carbamidomethylation of cysteines,  excuse me. This is an extremely common part of  

any bottom-up proteogenomics workflow, where  before you digest your proteins you will add a   reducing agent and then an alkylating agent  to get rid of any disulfide bonds, and then   cap your cysteine to prevent the disulfide bonds  from reforming, this just aids in digestion.   Uh, importantly for this experiment, we've  also got this, uh, modification here,   iTRAQ 4-plex of lysine, as well as iTRAQ  4-plex labeling of the peptide N-terminus.   iTRAQ is an iso, what's called an isobaric  tag, it's useful for quantitating different   experiments that are then concatenated  together. This is something that would have been   deliberately done in the processing step, so we're  going to include it in the fixed modifications.  

At the same time we've got a few variable  modifications, these are things that   might occur as a part of a biological process  within the cells: something like a phosphorylation   or an acetylation, or in the case here, we have  these sort of chemical reactions that are almost,   uh, can be considered side reactions, or  things that just sort of happen. For example,   uh, processing protein variably introduce  a degree of oxidation to methionine,   so it's important to include  that, as well as, um, iTRAQ   is generally meant to react with primary amines,  such as lysine and the peptide N-terminus,   but it can also sometimes react at tyrosine  and it's worth including that as well. In addition, uh, we've got PeptideShaker here,  where we've got it creating an mzIdentML file,   which is important for our workflow,  and just the default methodologies here. Then, down here we've got our, um,   first of two Query Tabular steps where we  manipulate our, our data. Essentially it's  

going to take outputs from PeptideShaker, and then  filter it based off of the, uh, contents of our   reference accession numbers. So if  it's, what, what this code means is, uh,   SQL, if anything is found within the reference  proteome accession numbers, it's going to be   removed from the PeptideShaker outputs. Then this  is all fine, good, this is a similar PepQuery   step, or a Query Tabular step, where uh anything  more, any peptides less than six amino acids long,   or more, more than 30 are going to be  removed, just because those tend to not,   uh, give as good spectra within the  mass spectrometer, so it's sort of,   uh, easier to remove them. You'll spend less  time sort of chasing those in the long run.   And finally, here at the bottom, we've got our,  uh, Tabular-to-FASTA step, which we need to do,   uh, BLAST-P analysis down the road.  Now, if you look here, you'll see   that there's a title column, and a sequence  column, just like we talked about in the, um,   PowerPoint slide, uh, portion of the tutorial, uh,   this you want to make sure it's, as we said,  you want to make sure that it’s populated,   so that the title column is one, sequence  column is two, so, uh, that's pretty much   everything you need. So, we're going to go  ahead and cue up this workflow. Now those of you   familiar with Galaxy will recognize this, as as,  uh, steps are cued, and jobs begin to happen here.  

Now, I didn't want to go ahead and make you sit  here and wait for this all to be done, so I did   go ahead and run this workflow in advance, just to  make sure everything worked. So let's go ahead and   take a peek at the finished product,  here so we will go into our,   uh, history collections and switch to the  previous one that was run successfully. Right, so here is our, uh, successfully  run, uh, Database Search workflow that we've   run here. So first we have the,  uh, output of SearchGUI; um,   that is just going to be a rather large  file that has all the information in it, um,   and that feeds directly into these outputs for  peptide, that feeds directly into PeptideShaker   that gives us these outputs here. Um, you've got  your mzIdentML file, uh, which will be useful   down the road, our parameters for analysis, as  well as our, uh, PSM report. Now, this is sort of   where things start to get interesting. So,  you've got here all the proteins that were  

identified in the system, er, rather all  the peptides that were identified here:   here's the peptide sequences, the proteins  that they come from, you can see that   some of them are common to multiple proteins, then  a bit more information about modified sequence,   the variable modifications, and fixed  modifications. This proceeds on to   all this other information, uh, charges of the  peptides, theoretical mass, precursors, etc, etc. Uh, from there, um, we get this mz-SQLite  file here, which is useful for visualizing,   uh, proteins; we can do that here, just a, by  way of example, we open up this MVP application. It can take a minute. Right, so, this  will give us a potential spectra here   that you can visualize down here, so  this is maybe not the best one. Anyway,   uh, back to this, we see our PepQuery  tabular, our first Query Tabular step here.  

Um, we remove, uh, those proteins  that are not found in the,   remove those proteins that are found  in the reference database, and it   cuts it down from, I think, several hundred or  a thousand, um, lines here, let me just see.   Oh yes, we've, we've gone from over 5000 lines to,  to just nine that are unique to our system and not   found in either the reference proteome or in the  contaminants. You can see here they're annotated   by these sort of, um, more unusual accession  numbers here. From there, you can go ahead and see  

that some of the ones that are too large  and small are removed where we have just   our simplified ID and the sequence. Then finally,  our last step we get this FASTA file here where   the column on the left is used as the header, and  then the sequence is used as the sequence. So,   uh, this is a successful invocation of  this, um, workflow and now you've got, um,   I think, yeah, sorry: you've got six sequences  of novel peptides that can be used in the   subsequent proteogenomics workflow  for analyzing novel peptides. Right, uh, just that's pretty much  it on how to run this, now let's   just jump back into the slides for a minute.  With that, I'd like to remind you that the other   members of the GalaxyP team at the University  of Minnesota are also giving tutorials today.   On the subject of proteogenomics itself, there is  an introductory presentation from Dr. Tim Griffin.  

In addition, James Johnson and Subina Mehta have  created excellent proteogenomics tutorials on the   generation of custom databases and novel peptide  analysis respectively. Beyond proteogenomics, Dr.   Pratik Jagtap has given a talk on metaproteomics  applications in Galaxy, which is not to be missed. Supplementary tutorials for proteogenomics  can be found at the Galaxy Training Network,   at the address shown below. I hope  you found this tutorial useful,   and hope that you will embark on your own  proteogenomic experiments in the future.   Thank you for your attention, and  enjoy the rest of the Smörgåsbord.

2021-03-26

Show video