GTN Smörgåsbord - Day 1 - Genome Assembly Introduction Slides

Show video

Hello everybody, and welcome to an  introduction to genome assembly. My   name is Simon Gladman and I'm a bioinformatician  from the University of Melbourne in Australia.   During this slide presentation I'll be giving you  a bit of an overview of what genome assembly is,   and why we want to do it. And then hopefully  preparing you for the upcoming tutorial. Okay, just a couple of requirements  before we start. Hopefully   you've all done the introduction to  Galaxy analyses, and you understand. Or,   you've used Galaxy before and you understand  how to use it, If not, please go and have   a look at that video first. And hopefully you  understand a little bit about sequence analysis,  

mainly around quality control. We're not going  to really touch on much quality control here   as it's covered in another topic. Quality  control is vitally important in genome assembly.   Okay so the question we're going to try and answer  is: How do we perform a very basic genome assembly   from short read data? Okay, so we're going to talk  about de novo genome assembly, and I'd like to   thank Torsten Seemann, Dieter Bulach, Ira Cooke,  for a lot of the slides in this presentation.

Okay, so we'll get started. We'll talk about  de novo assembly which is the process of   reconstructing the original DNA sequence from the  fragment reads alone. And why do we have fragment   reads? Well the simple reason is that none of our  equipment can read a full genome at the moment.   And so we can only read small pieces at a time.  And so what we do is we uh break the DNA up into  

small fragments, we send it off to a sequencing  lab somewhere. They determine the sequence of   some parts of those fragments and then send those  results back to us. And what we get is a file full   of reads; which are pieces of the DNA, that came  from the sample that we sent. Now unfortunately,   these are in very very small pieces compared to  the length of the genome usually. And so what we   need to do is we need to fit them together like,  almost like a jigsaw puzzle, to put them back   together. And so we need to find reads that fit  together by moving them around, and comparing them   against one another. There could be some problems  though; there could be some missing pieces,  

and some other pieces could be dirty,  like there might be some errors in them.   Or another view of what we're doing is:  imagine if you have a stack of newspapers,   and then you put those newspapers through the  shredder, and then you give them to a kindergarten   class. And you say to the kindergarten: “Hey,  glue the original newspaper back together”.   And they try and glue the draft  newspaper back together again.   And then, so the draft sequence is little bits  of stuff stuck together, and then the closed   original newspaper again is what we're trying  to get. Okay maybe it might work better if I   give you an example. So here we have a string  of characters, we'll call it a small genome.  

“Friends, Romans, countrymen, lend me  your ears”. Okay, so this is one sentence.   So we send our sample containing that string off  to a sequencing lab, and they chop it up into   little bits, and make lots of copies, and then  they read the bits that they can read of each of   those fragments, and send the resultant reads back  to us. “ds, Romans, count”, “ns, countrymen, le”,   “Friends, Rom”, “send me your ears;”, and  “crymen, lend me”. Okay, and you'll notice that  

there might have been an error when the sequencing  center read our sample. So what we need to do now,   is we need to find all the overlaps. As you can  see here, “Friends Rom” kind of overlaps with this   “ds, Rom” here doesn't it? So we can see  this “ds Rom” overlaps with this “ds Rom”,   and so maybe they fit together, uh this read and  this read fit together. And so we'll lay it out   like this. And then we'll say uh “ns, countryman,  le” oh look there's an “ns count” here, so maybe   they overlap as well. And so on and so on, and  we find overlaps but that makes sense between our  

reads. And then we lay it out like this. Okay, so  you can sort of see we've laid out all the reads   with all of their overlaps, and then the next  step to do is find what we call a consensus.   So that is: we look at the evidence of the reads,  and we try and figure out what the original   sequence, or sentence, might have been. And so for  this position, the first position, we have an ‘F’,   and that's the only evidence we have, so  maybe our majority consensus is an ‘F’,   and so on for the ‘r’, and the ‘i’,  in the ‘n’. But here we have ‘d’,   this is a bit stronger, we have two of our reads  that suggest that there's a ‘d’ in this position   in our, in our original sequence. And so, we  can say with a bit more confidence perhaps,  

that this is probably going to be a ‘d’. And  this one here is probably going to be an ‘s’,   and this one's going to be a comma, and then  we'll have a space, and so on as we go to here.   And then, when we get to here, you  see we have three pieces of evidence.   We have three reads that kind of overlap here.  And two of them are saying there's a ‘t’ here,   and one of them is saying a ‘c’. And  then there's two ‘t's versus one ‘c’,  

so maybe the ‘c’ was a mistake, and so we'll  say that in this position we're going to   assume that our original sequence had a ‘t’  in this position here. And exactly the same   with this position here; “le”, “lend me”, and  “send me your ears”, um we're going to say: well   perhaps, perhaps this ‘s’ is an error, because we  have more evidence to say that it was an ‘l’. And   so we have a final majority consensus: “Friends,  Romans, countrymen lend me your ears”. And we   have reconstructed our original sequence from our  reads alone, by doing overlap, layout, consensus.   So far so good! However, the awful truth is  that one does not simply assemble a genome.   And associate professor Mihai Pop, who um did a  lot of research into the area of genome assembly,   actually stated that genome assembly is  impossible under certain circumstances.  

And we'll talk about what  those circumstances are soon. So why is it so hard? Well, unlike our example  that we just talked about, we don't have   five reads, we have millions of pieces. And  they're much much shorter than the genome. So   if we have a human genome, that has three billion  bases in its entirety, uh some of the chromosomes   are very very large. And if we've only got short  read technologies, the longest ones we can get are   about 300 bases. And if we’ve got long read  technologies we can get them out to maybe  

70 000, bases but that's still well short of  the length of the uh of the original genome.   And another problem is: that a lot of the  sequence is repeated throughout our genomes.   And throughout the genomes that we're  studying. And so they look very very similar.  

And there's lots of missing pieces. So some of  them just can't be sequenced. They're too high in   GC content, it could be lots of reasons, and there  could be a lot of errors in some of those reads.   So it's kind of like doing a jigsaw puzzle  where you don't know what the picture is,   your dog’s got to some of the pieces, and chewed  some of them so they don't quite fit anymore,   and one of your children has drawn over  the top of some of them with textile.   And so it's actually quite a difficult problem.   But we have a basic recipe that we can follow,  to help us. So one of the things we will do,   is we will find all the overlaps between the  reads. We'll see where each read overlaps all  

the other ones. That sounds like a lot of work,  especially for millions and millions of reads.   And then we're going to build a graph, which is  a picture of the read connections. So we're going   to place a read on the table, and then we're going  to place another read on the table, and then we'll   draw a connection between it. And then we'll add  another read, and draw the connections in, and add   another read and draw the connections in, and so  we'll slowly slowly build up this giant picture   of all the reads, and then all  their connections to one another.  

And so we call this thing a graph. And then what  we need to do is simplify that graph. Because it's   going to be very very complicated. We're going to  have reads that overlap and many many many other   reads. And so maybe we can remove redundancy, we  can do a bunch of other things that might help us   simplify that graph. Unfortunately though,  sequencing errors will show up in this graph  

and will mess it up a lot. And then finally we  will traverse the graph, we will try to trace   a sensible path through the graph to produce a  consensus. Now this is not a trivial problem at   all, in fact it can be quite difficult to do this  but this is what our general assembly recipe is. And in pictorial form you can see here we have  our reads, we find out all of the overlaps,   and then we lay them down and draw all the  connections between them. And then we start to  

look for paths that visit each of these things  once. So we can go from this read to this read,   to this read, to this read, and we'll pull out  this green to purple part here. So green, blue,   purple. And we'll pull out this consensus  sequence here. But then what do we do with   these parts? And what do we do with the fact that  this one has lots of connections going into it?   And this one has lots of connections coming out  of it. It's a lot more complicated than we think.

And this is what a realistic graph  looks like. It's very complicated, and   tracing a path around this, this  part, sure, you can get from here,   down, and then do I go this way? or that way?  I don't know, um we can chase another thing   I'm actually it's a lot more complicated and  none of these things ugh, it's a nightmare. It is genuinely no fun. So what ruins this  graph? What makes it so complicated? Why  

can't we just jigsaw it back together and  make it really simple? Well, read errors.   They introduce false edges and nodes  in our graph, or they introduce false   connections in our graph. Non-haploid organisms.  Polyploid organisms like humans, who are diploid,   have this thing called heterozygosity. Where  one version of a chromosome might be slightly   different to the other version of the chromosome.  And so these can cause detours in our assembly   graph, where we can have two possible paths  through our graph. This makes it very very  

difficult. How do we, how do we choose which one  to go down? And how do we know that one isn't just   an error, or it's real? There are so many things  we need to think about. And then repeats. If the   repeats are longer than the read length it's  actually impossible for us to assemble. But if   the read length is longer than the biggest repeat,  then maybe we've got a chance of assembling it.   And repeats cause nodes to be shared, and we get  locality confusion. And so we can go into a read,  

but then we have no way of knowing which  path to take out of one of those reads. Okay,   so we're going to talk about repeats, because  they're really important in our genome assembly.   What is a repeat? Well it repeats a segment of  DNA which occurs more than once in the genome   sequence. And they're very very common,  things like transposons, and satellites,   and gene duplications, occur regularly in  pretty much all genomes. Except for viruses  

because they're usually very compact. But  pretty much everything else, bacteria, fungi,   high eukaryotes, plants, animals,  all have huge numbers of repeats. Okay how does this affect our  assembly? Well if you imagine   that these black lines here are our reads, and  we've found overlaps, and we've laid out this   nice consensus here we've laid them out. And then  we found this nice consensus in this black region.   But in this red region here, we know that that red  region there is identical to this red region here   right. But we have different sets of reads  from it because when we broke the DNA up  

and sent it off to the sequencing center. They  read this part, and this part, and this part,   and this part, and this part, but to  us, and to the computer, they look   identical. so this red read here looks identical  to a combination of these yellow reads here.   And so when we do overlap detection, these red  reads get piled up on top of the yellow reads,   and so even though they came from completely  different parts of the genome originally,   they get lumped together. And we can't  split them, because uh well this part of   this genome here, we know that maybe  it's connected to this red section.  

But then when we, when we're coming out of  it here, we don't know whether to go into   the yellow section or the red section. So we can  come into it no problems we don't know which way   to go out of it. But this part here assembled up  really nicely, with no repeats in it, and so yeah,   sure, we can assemble that part. But  this repeat we're going to have this   collapse of the consensus, and we're going  to just produce one repeat instead of two. Alright, so the law of repeats is: it is  impossible to resolve repeats of length S   unless you have reads longer than  S. And I've written it here twice  

because you need to keep it  in the back of your mind. Alright. However, we can do some tricks. I'm going  to talk about scaffolding, and what that means.   So how do we go beyond contigs? So contigs are  contiguous pieces of DNA that we can put back   together. So this section here, this nice long  section here where there's no repeats, there   doesn't seem to be any problems, and so when  we extract that out of our assembly, we get   this piece here called a contig. Whereas the  repeats get collapsed down into another contig.   Okay, so we want to go beyond contigs, but their  sizes are limited by the length of the repeats in   our genome, we can't change that, and the length  or the span of the reads. And we can use long read   technology to hopefully overcome these repeats,  but we can also use tricks with other technology.  

We can change the type of reads that we produce.  So say we have this example for fragment,   this is a fragment of our genome. So  we've sent our sample off to the lab,   they've broken it down into small little pieces so  it'll fit into their machine, and then normally,   well originally, what they would do is they would  read one end until the machine stopped working.   And then they would call that the read. And so  they would only sequence one end of the fragment.   However, they got to be cleverer, and then they  thought: Well why don't we sequence both ends of   the fragments? And so they sequence the first bit,  and then they sequence the last bit. Now we know   that this particular read is related  to this particular read. All right,  

because they're on the same fragment.  And we can also measure, roughly,   the length of this fragment. And so we kind  of know how far apart these reads need to be   when we're doing our assembly. So when we do our  overlap detection etc etc, what we do is we record  

where the pair of these two reads lie in relation  to one another in our consensus. And we can say:   Are they roughly the right distance apart?  Are they too far apart? Are they too close   together? And so we can exploit that kind  of information in our assembly algorithms. Okay so when we do scaffolding  we're doing exactly that. We know   that the sequences are related to each other  because they came from a single fragment,   and we roughly know how long the fragments were.   And most of the time the pairs will occur in the  same contigs but occasionally the pairs will be   on different contigs, and this is evidence  that these contigs are linked together. And this is what we're talking about here. So  sometimes we have one of the paired reads on this  

contig, another one of the paired reads on this  contig, that's evidence that these two contigs   are linked. And we can also get a direction  for these contigs, so we sort of know which   way they need to face in relation to the other  ones. And then we can say, you know this paired   end read matches one here, so we can connect those  together. And there's another one that came from a  

bigger fragment, and you know they sort of join up  together there. So when we when we lay the contigs   out, and try and assemble those using the evidence  that we have, we can sort of see that there's a   contig here, a contig here, and a contig here, and  we kind of don't really know what goes in there,   but it's probably going to be some kind of  repeated element. Probably. Or simply an area of   the genome that we didn't sequence, or couldn't  sequence but this will give us what we call a   scaffold and then we can go back and target  these kind of areas and try and fill them in. Okay so how do we assess how good our  assemblies are? Well we desire our total   length of all of our contigs to be similar to  the genome size, we want to assemble as much as   our original genome as possible. So if we've got a  five megabase genome, say an E. coli or something  

and we've got three megabases in contigs then  we've done a pretty poor job of assembling.   But if we have you know close to five,  4.9, 5.1 megabases of contigs then   perhaps we're getting close to having  an assembly of most of our genome.   We want fewer larger contigs. So we want  our contigs to be big right? We don't want  

to have lots and lots and lots of little ones  because that's not going to tell us anything.   We want to have a few large contigs that make  up most of our assembly. And we want them to be   correct and we need the ways of checking to see if  they're correct. Because remember we don't really   know anything about this genome before we do any  of this work. So we don't know what it looks like,   we don't know what the contigs are meant to  look like, we kind of know the length but that's   about it. So how do we know they're correct? Well  there's some ways we can sort of figure that out   and to do this we have some things called metricsm  there's no real generally useful measure, because   you don't really have any prior information. So we  don't have a truth set we can't say yes this is 99  

true because we don't really know. But what we  can do is we can sort of measure the number of   long contigs, or we can sort of look at the total  number of bases in contigs and compare it with   our genome length. We can calculate this thing  called the n50 which is a statistic that gets   used a lot in assemblies, and it's basically  a measure of how together my assembly is.  

So the n50 is the length of that contig from which  50% of the bases are in it and shorter contigs.   So imagine we have seven contigs out of  our assembly, with the length one another   length 1 length 3 length 5 length 8 length 12 and  like 20. Well the way you calculate the n50 is you   lay them out in order like that and then you sum  them all the way up. So when you add all these   numbers together you get 50. So our total number  of bases in contigs is 50 and half of that is 25.  

So now what we want to do is start at the smallest  and add them together until we get to 25 or above   .And then the length of the contig that  is the last one that we add to our sum,   that is the n50. And so we go 1 plus 1 2 5 10  18. it's not 25 yet so we need to add the 12 on   and we get 30 which is greater than 25 and so the  last one that we added into this sum was the 12,   and so our n50 of our assembly is 12. And so  basically we're saying that 50 of our bases   are contained in contigs bigger than 12.  okay so there are two levels of assembly   there's a draft assembly and a closed or finished  assembly. The closed or finished assembly  

is usually our goal have a finished reference  sequence for our organism of choice, however,   sometimes we get to the draft assembly stage where  we get the end of the scaffolding step and we   we've got a number of non-linked scaffolds,  we've got gaps and we've got unknown sequences   in bits of them, but, yeah, we've got probably  80 percent of the genome sort of put together and   laid out in a plan. And this is fairly  easy to get to and sometimes that's enough,   but closing or finishing the assembly: we  want one sequence for each chromosome instead   of a bunch of scaffolds. It takes a lot more work  because we need to look at each gap individually,   we need to figure out what goes in each of these  gaps. Small genomes are becoming much easier to   do. So we can do a whole bacteria now, and we  can almost guarantee we're going to get a closed   genome out of it at the end with long read  technologies. Using Oxford Nanopore or PacBio or  

one of the other long read technologies. Large  genomes like the human genome are much more   difficult, even with long read technologies, it's  still not easy and it's still very expensive, and   still the province of consortiums, and  for example the human genome consortium.   So how do you actually go about doing an assembly?  So we have an example: we culture our bacteria,   we extract our genomic DNA, we send it off to a  sequencing center for say Illumina sequencing,   and what we get back is 250 base pair,  paired-end we get back two text files   from a little vial that we sent off to  our sequencing center. Now what do you do?

Well we use a tool, we can use some assembly  tools. In some of the tutorials some of the   earlier tutorials that we have on the GTN, we use  tools like velvet and the velvet-optimizer and   Spades. I have to point out, please, that velvet  and the velvet optimizer are training tools.   They are however very good for teaching people  how to run assembly tools, and what's going on   behind the scenes. So if you want to learn  about assembly then velvet's okay, but if  

you actually want to do an assembly then use  something else. Something like Spades or Abyss,   Newbler doesn't really exist anymore, SGA,  AllPaths, SOAP, there's hundreds of them... Canu. The list is endless. We can also not just assemble  genomes! We can assemble other things like  

metagenomes and we can assemble transcriptomes  using things like Trinity, Trans-abyss and there   are many many many others. In fact if you look  up genome assembly and Wikipedia, and then go   to the list of tools, that's almost you have to  scroll through like five or six pages of tools. Right you'll be doing an exercise, but not  using velvet, a bit later on hopefully,   if you do the um assembly tutorial. So thank  you for listening. So this concludes the  

introductory slides for genome assembly, um if  you would like to know more or learn a little   bit more about genome assembly or would like to  have a look at some of the algorithms that the   genome assemblers use, if you  click up here on the the top left   of the slide deck it will return you to the  topic page on assembly. And you can see here   there are a lot of different slide decks in this  section, and the ones we did were these ones, but   there's ones that go into the details about the  De Bruijn graph. This one here, a deeper look into   genome assembly algorithms, is actually really  good and explains to you exactly what's going on   inside the tools if you're interested.  All right thank you very much and goodbye!

2021-03-26

Show video