GTN Smörgåsbord - Day 1 - Genome Assembly: Introduction (Slides)
Hello everybody, and welcome to an introduction to genome assembly. My name is Simon Gladman and I'm a bioinformatician from the University of Melbourne in Australia. During this slide presentation I'll be giving you a bit of an overview of what genome assembly is, and why we want to do it. And then hopefully preparing you for the upcoming tutorial. Okay, just a couple of requirements before we start. Hopefully you've all done the introduction to Galaxy analyses, and you understand. Or, you've used Galaxy before and you understand how to use it, If not, please go and have a look at that video first. And hopefully you understand a little bit about sequence analysis,
mainly around quality control. We're not going to really touch on much quality control here as it's covered in another topic. Quality control is vitally important in genome assembly. Okay so the question we're going to try and answer is: How do we perform a very basic genome assembly from short read data? Okay, so we're going to talk about de novo genome assembly, and I'd like to thank Torsten Seemann, Dieter Bulach, Ira Cooke, for a lot of the slides in this presentation.
Okay, so we'll get started. We'll talk about de novo assembly which is the process of reconstructing the original DNA sequence from the fragment reads alone. And why do we have fragment reads? Well the simple reason is that none of our equipment can read a full genome at the moment. And so we can only read small pieces at a time. And so what we do is we uh break the DNA up into
small fragments, we send it off to a sequencing lab somewhere. They determine the sequence of some parts of those fragments and then send those results back to us. And what we get is a file full of reads; which are pieces of the DNA, that came from the sample that we sent. Now unfortunately, these are in very very small pieces compared to the length of the genome usually. And so what we need to do is we need to fit them together like, almost like a jigsaw puzzle, to put them back together. And so we need to find reads that fit together by moving them around, and comparing them against one another. There could be some problems though; there could be some missing pieces,
and some other pieces could be dirty, like there might be some errors in them. Or another view of what we're doing is: imagine if you have a stack of newspapers, and then you put those newspapers through the shredder, and then you give them to a kindergarten class. And you say to the kindergarten: “Hey, glue the original newspaper back together”. And they try and glue the draft newspaper back together again. And then, so the draft sequence is little bits of stuff stuck together, and then the closed original newspaper again is what we're trying to get. Okay maybe it might work better if I give you an example. So here we have a string of characters, we'll call it a small genome.
“Friends, Romans, countrymen, lend me your ears”. Okay, so this is one sentence. So we send our sample containing that string off to a sequencing lab, and they chop it up into little bits, and make lots of copies, and then they read the bits that they can read of each of those fragments, and send the resultant reads back to us. “ds, Romans, count”, “ns, countrymen, le”, “Friends, Rom”, “send me your ears;”, and “crymen, lend me”. Okay, and you'll notice that
there might have been an error when the sequencing center read our sample. So what we need to do now, is we need to find all the overlaps. As you can see here, “Friends Rom” kind of overlaps with this “ds, Rom” here doesn't it? So we can see this “ds Rom” overlaps with this “ds Rom”, and so maybe they fit together, uh this read and this read fit together. And so we'll lay it out like this. And then we'll say uh “ns, countryman, le” oh look there's an “ns count” here, so maybe they overlap as well. And so on and so on, and we find overlaps but that makes sense between our
reads. And then we lay it out like this. Okay, so you can sort of see we've laid out all the reads with all of their overlaps, and then the next step to do is find what we call a consensus. So that is: we look at the evidence of the reads, and we try and figure out what the original sequence, or sentence, might have been. And so for this position, the first position, we have an ‘F’, and that's the only evidence we have, so maybe our majority consensus is an ‘F’, and so on for the ‘r’, and the ‘i’, in the ‘n’. But here we have ‘d’, this is a bit stronger, we have two of our reads that suggest that there's a ‘d’ in this position in our, in our original sequence. And so, we can say with a bit more confidence perhaps,
that this is probably going to be a ‘d’. And this one here is probably going to be an ‘s’, and this one's going to be a comma, and then we'll have a space, and so on as we go to here. And then, when we get to here, you see we have three pieces of evidence. We have three reads that kind of overlap here. And two of them are saying there's a ‘t’ here, and one of them is saying a ‘c’. And then there's two ‘t's versus one ‘c’,
so maybe the ‘c’ was a mistake, and so we'll say that in this position we're going to assume that our original sequence had a ‘t’ in this position here. And exactly the same with this position here; “le”, “lend me”, and “send me your ears”, um we're going to say: well perhaps, perhaps this ‘s’ is an error, because we have more evidence to say that it was an ‘l’. And so we have a final majority consensus: “Friends, Romans, countrymen lend me your ears”. And we have reconstructed our original sequence from our reads alone, by doing overlap, layout, consensus. So far so good! However, the awful truth is that one does not simply assemble a genome. And associate professor Mihai Pop, who um did a lot of research into the area of genome assembly, actually stated that genome assembly is impossible under certain circumstances.
And we'll talk about what those circumstances are soon. So why is it so hard? Well, unlike our example that we just talked about, we don't have five reads, we have millions of pieces. And they're much much shorter than the genome. So if we have a human genome, that has three billion bases in its entirety, uh some of the chromosomes are very very large. And if we've only got short read technologies, the longest ones we can get are about 300 bases. And if we’ve got long read technologies we can get them out to maybe
70 000, bases but that's still well short of the length of the uh of the original genome. And another problem is: that a lot of the sequence is repeated throughout our genomes. And throughout the genomes that we're studying. And so they look very very similar.
And there's lots of missing pieces. So some of them just can't be sequenced. They're too high in GC content, it could be lots of reasons, and there could be a lot of errors in some of those reads. So it's kind of like doing a jigsaw puzzle where you don't know what the picture is, your dog’s got to some of the pieces, and chewed some of them so they don't quite fit anymore, and one of your children has drawn over the top of some of them with textile. And so it's actually quite a difficult problem. But we have a basic recipe that we can follow, to help us. So one of the things we will do, is we will find all the overlaps between the reads. We'll see where each read overlaps all
the other ones. That sounds like a lot of work, especially for millions and millions of reads. And then we're going to build a graph, which is a picture of the read connections. So we're going to place a read on the table, and then we're going to place another read on the table, and then we'll draw a connection between it. And then we'll add another read, and draw the connections in, and add another read and draw the connections in, and so we'll slowly slowly build up this giant picture of all the reads, and then all their connections to one another.
And so we call this thing a graph. And then what we need to do is simplify that graph. Because it's going to be very very complicated. We're going to have reads that overlap and many many many other reads. And so maybe we can remove redundancy, we can do a bunch of other things that might help us simplify that graph. Unfortunately though, sequencing errors will show up in this graph
and will mess it up a lot. And then finally we will traverse the graph, we will try to trace a sensible path through the graph to produce a consensus. Now this is not a trivial problem at all, in fact it can be quite difficult to do this but this is what our general assembly recipe is. And in pictorial form you can see here we have our reads, we find out all of the overlaps, and then we lay them down and draw all the connections between them. And then we start to
look for paths that visit each of these things once. So we can go from this read to this read, to this read, to this read, and we'll pull out this green to purple part here. So green, blue, purple. And we'll pull out this consensus sequence here. But then what do we do with these parts? And what do we do with the fact that this one has lots of connections going into it? And this one has lots of connections coming out of it. It's a lot more complicated than we think.
And this is what a realistic graph looks like. It's very complicated, and tracing a path around this, this part, sure, you can get from here, down, and then do I go this way? or that way? I don't know, um we can chase another thing I'm actually it's a lot more complicated and none of these things ugh, it's a nightmare. It is genuinely no fun. So what ruins this graph? What makes it so complicated? Why
can't we just jigsaw it back together and make it really simple? Well, read errors. They introduce false edges and nodes in our graph, or they introduce false connections in our graph. Non-haploid organisms. Polyploid organisms like humans, who are diploid, have this thing called heterozygosity. Where one version of a chromosome might be slightly different to the other version of the chromosome. And so these can cause detours in our assembly graph, where we can have two possible paths through our graph. This makes it very very
difficult. How do we, how do we choose which one to go down? And how do we know that one isn't just an error, or it's real? There are so many things we need to think about. And then repeats. If the repeats are longer than the read length it's actually impossible for us to assemble. But if the read length is longer than the biggest repeat, then maybe we've got a chance of assembling it. And repeats cause nodes to be shared, and we get locality confusion. And so we can go into a read,
but then we have no way of knowing which path to take out of one of those reads. Okay, so we're going to talk about repeats, because they're really important in our genome assembly. What is a repeat? Well it repeats a segment of DNA which occurs more than once in the genome sequence. And they're very very common, things like transposons, and satellites, and gene duplications, occur regularly in pretty much all genomes. Except for viruses
because they're usually very compact. But pretty much everything else, bacteria, fungi, high eukaryotes, plants, animals, all have huge numbers of repeats. Okay how does this affect our assembly? Well if you imagine that these black lines here are our reads, and we've found overlaps, and we've laid out this nice consensus here we've laid them out. And then we found this nice consensus in this black region. But in this red region here, we know that that red region there is identical to this red region here right. But we have different sets of reads from it because when we broke the DNA up
and sent it off to the sequencing center. They read this part, and this part, and this part, and this part, and this part, but to us, and to the computer, they look identical. so this red read here looks identical to a combination of these yellow reads here. And so when we do overlap detection, these red reads get piled up on top of the yellow reads, and so even though they came from completely different parts of the genome originally, they get lumped together. And we can't split them, because uh well this part of this genome here, we know that maybe it's connected to this red section.
But then when we, when we're coming out of it here, we don't know whether to go into the yellow section or the red section. So we can come into it no problems we don't know which way to go out of it. But this part here assembled up really nicely, with no repeats in it, and so yeah, sure, we can assemble that part. But this repeat we're going to have this collapse of the consensus, and we're going to just produce one repeat instead of two. Alright, so the law of repeats is: it is impossible to resolve repeats of length S unless you have reads longer than S. And I've written it here twice
because you need to keep it in the back of your mind. Alright. However, we can do some tricks. I'm going to talk about scaffolding, and what that means. So how do we go beyond contigs? So contigs are contiguous pieces of DNA that we can put back together. So this section here, this nice long section here where there's no repeats, there doesn't seem to be any problems, and so when we extract that out of our assembly, we get this piece here called a contig. Whereas the repeats get collapsed down into another contig. Okay, so we want to go beyond contigs, but their sizes are limited by the length of the repeats in our genome, we can't change that, and the length or the span of the reads. And we can use long read technology to hopefully overcome these repeats, but we can also use tricks with other technology.
We can change the type of reads that we produce. So say we have this example for fragment, this is a fragment of our genome. So we've sent our sample off to the lab, they've broken it down into small little pieces so it'll fit into their machine, and then normally, well originally, what they would do is they would read one end until the machine stopped working. And then they would call that the read. And so they would only sequence one end of the fragment. However, they got to be cleverer, and then they thought: Well why don't we sequence both ends of the fragments? And so they sequence the first bit, and then they sequence the last bit. Now we know that this particular read is related to this particular read. All right,
because they're on the same fragment. And we can also measure, roughly, the length of this fragment. And so we kind of know how far apart these reads need to be when we're doing our assembly. So when we do our overlap detection etc etc, what we do is we record
where the pair of these two reads lie in relation to one another in our consensus. And we can say: Are they roughly the right distance apart? Are they too far apart? Are they too close together? And so we can exploit that kind of information in our assembly algorithms. Okay so when we do scaffolding we're doing exactly that. We know that the sequences are related to each other because they came from a single fragment, and we roughly know how long the fragments were. And most of the time the pairs will occur in the same contigs but occasionally the pairs will be on different contigs, and this is evidence that these contigs are linked together. And this is what we're talking about here. So sometimes we have one of the paired reads on this
contig, another one of the paired reads on this contig, that's evidence that these two contigs are linked. And we can also get a direction for these contigs, so we sort of know which way they need to face in relation to the other ones. And then we can say, you know this paired end read matches one here, so we can connect those together. And there's another one that came from a
bigger fragment, and you know they sort of join up together there. So when we when we lay the contigs out, and try and assemble those using the evidence that we have, we can sort of see that there's a contig here, a contig here, and a contig here, and we kind of don't really know what goes in there, but it's probably going to be some kind of repeated element. Probably. Or simply an area of the genome that we didn't sequence, or couldn't sequence but this will give us what we call a scaffold and then we can go back and target these kind of areas and try and fill them in. Okay so how do we assess how good our assemblies are? Well we desire our total length of all of our contigs to be similar to the genome size, we want to assemble as much as our original genome as possible. So if we've got a five megabase genome, say an E. coli or something
and we've got three megabases in contigs then we've done a pretty poor job of assembling. But if we have you know close to five, 4.9, 5.1 megabases of contigs then perhaps we're getting close to having an assembly of most of our genome. We want fewer larger contigs. So we want our contigs to be big right? We don't want
to have lots and lots and lots of little ones because that's not going to tell us anything. We want to have a few large contigs that make up most of our assembly. And we want them to be correct and we need the ways of checking to see if they're correct. Because remember we don't really know anything about this genome before we do any of this work. So we don't know what it looks like, we don't know what the contigs are meant to look like, we kind of know the length but that's about it. So how do we know they're correct? Well there's some ways we can sort of figure that out and to do this we have some things called metricsm there's no real generally useful measure, because you don't really have any prior information. So we don't have a truth set we can't say yes this is 99
true because we don't really know. But what we can do is we can sort of measure the number of long contigs, or we can sort of look at the total number of bases in contigs and compare it with our genome length. We can calculate this thing called the n50 which is a statistic that gets used a lot in assemblies, and it's basically a measure of how together my assembly is.
So the n50 is the length of that contig from which 50% of the bases are in it and shorter contigs. So imagine we have seven contigs out of our assembly, with the length one another length 1 length 3 length 5 length 8 length 12 and like 20. Well the way you calculate the n50 is you lay them out in order like that and then you sum them all the way up. So when you add all these numbers together you get 50. So our total number of bases in contigs is 50 and half of that is 25.
So now what we want to do is start at the smallest and add them together until we get to 25 or above .And then the length of the contig that is the last one that we add to our sum, that is the n50. And so we go 1 plus 1 2 5 10 18. it's not 25 yet so we need to add the 12 on and we get 30 which is greater than 25 and so the last one that we added into this sum was the 12, and so our n50 of our assembly is 12. And so basically we're saying that 50 of our bases are contained in contigs bigger than 12. okay so there are two levels of assembly there's a draft assembly and a closed or finished assembly. The closed or finished assembly
is usually our goal have a finished reference sequence for our organism of choice, however, sometimes we get to the draft assembly stage where we get the end of the scaffolding step and we we've got a number of non-linked scaffolds, we've got gaps and we've got unknown sequences in bits of them, but, yeah, we've got probably 80 percent of the genome sort of put together and laid out in a plan. And this is fairly easy to get to and sometimes that's enough, but closing or finishing the assembly: we want one sequence for each chromosome instead of a bunch of scaffolds. It takes a lot more work because we need to look at each gap individually, we need to figure out what goes in each of these gaps. Small genomes are becoming much easier to do. So we can do a whole bacteria now, and we can almost guarantee we're going to get a closed genome out of it at the end with long read technologies. Using Oxford Nanopore or PacBio or
one of the other long read technologies. Large genomes like the human genome are much more difficult, even with long read technologies, it's still not easy and it's still very expensive, and still the province of consortiums, and for example the human genome consortium. So how do you actually go about doing an assembly? So we have an example: we culture our bacteria, we extract our genomic DNA, we send it off to a sequencing center for say Illumina sequencing, and what we get back is 250 base pair, paired-end we get back two text files from a little vial that we sent off to our sequencing center. Now what do you do?
Well we use a tool, we can use some assembly tools. In some of the tutorials some of the earlier tutorials that we have on the GTN, we use tools like velvet and the velvet-optimizer and Spades. I have to point out, please, that velvet and the velvet optimizer are training tools. They are however very good for teaching people how to run assembly tools, and what's going on behind the scenes. So if you want to learn about assembly then velvet's okay, but if
you actually want to do an assembly then use something else. Something like Spades or Abyss, Newbler doesn't really exist anymore, SGA, AllPaths, SOAP, there's hundreds of them... Canu. The list is endless. We can also not just assemble genomes! We can assemble other things like
metagenomes and we can assemble transcriptomes using things like Trinity, Trans-abyss and there are many many many others. In fact if you look up genome assembly and Wikipedia, and then go to the list of tools, that's almost you have to scroll through like five or six pages of tools. Right you'll be doing an exercise, but not using velvet, a bit later on hopefully, if you do the um assembly tutorial. So thank you for listening. So this concludes the
introductory slides for genome assembly, um if you would like to know more or learn a little bit more about genome assembly or would like to have a look at some of the algorithms that the genome assemblers use, if you click up here on the the top left of the slide deck it will return you to the topic page on assembly. And you can see here there are a lot of different slide decks in this section, and the ones we did were these ones, but there's ones that go into the details about the De Bruijn graph. This one here, a deeper look into genome assembly algorithms, is actually really good and explains to you exactly what's going on inside the tools if you're interested. All right thank you very much and goodbye!