GTN Smörgåsbord - Day 5 - Assembly - Unicycler assembly of SARS-CoV-2 genome
Hello everyone, my name is Cristobal Gallardo and I will introduce you to the amazing world of genome assembly. The two main questions you can find out here or in the tutorial. The first one is: How can a genome be assembled against a background of contaminating reads from other genomes? And the second one is: How can sequencing data be used to obtain an assembled genome? In this tutorial, we are going to address five main objectives. The first one is to know the basic characteristics of SARS-CoV-2.
The second one is to understand Nanopore and Illumina sequencing technologies. The third objective is to detect and remove human reads from SARS-CoV-2 sequencing reads. The fourth objective is to understand the main de novo genome assembly algorithms. And the last objective is to perform a hybrid a de novo genome assembly. This is the workflow that we are going to follow along this practice.
At first we’ll introduce some basic theory about SARS-Cov-2, Nanopore and illumina technologies. Then I’ll show you how to get data from the NCBI SRA database. After that, we’ll perform the data quality assessment.
The next step will be to detect and remove human reads from SARS-Cov-2 sequences, and the last step will be to perform the genome assembly. SARS-Cov2 virus is a betacoronavirus which belongs to the family Coronavidinae. Its genome is a positive-sense single-strand RNA of 30kb. It includes 14 functional ORF’s which encode around 9860 amino acids. It codifies for structural proteins and 23 non-structural ones.
The structural of proteins are codified by the genes’ *S*pike *E*nvelope *M*embrane and nucleocapsid. NSP genes encode the replicase-transcriptase complex complex. As we can see, the spike membrane and envelope proteins are arranged on the surface. The spike proteins are glycoproteins with a broad tropism for the mammalian ACE2 receptor, which are considered the primary host cell target with which the virus associates. For this practice, we’ll use a combination of long and short reads to produce the genome sequence. Short reads will be used to obtain an initial assembly, then long reads will be used to resolve potential ambiguities in the previously assembled genome.
Finally, low error rate short reads will be used to polish the assembly, in order to remove potential errors. Short reads have been generated by Illumina technology, characterised for being short size and low error rate. On the other hand, the long reads have been generated by Oxford Nanopore technology, characterised by having long size and high error rate.
The basic Illumina sequencing process involves 6 main steps. In the first step, after fragmenting the DNA into multiple species, adaptors are added. Each nanowell contains oligonucleotides that provide anchoring points for the adaptors. Following, nucleic fragments bends and attach to the following oligonucleotides.
In the third stage, primers attach to the adapter binding site which is recognised by the DNA polymerase. During the polymerization, fluorescently tagged dNTPs are used to synthesize the complementary string. Each of the four bases emit light in a unique wavelength which allows you to identify which base is incorporated.
In the fifth stage after the polymerization, the double-stranded DNA is then naturalized. This sequential process is repeated, over and over again, so that all DNA strands in one area proceed from a single source, an approach called Clonal Amplification. Oxford Nanopore technology works in a quite different way. In this case, single nucleic acid molecules pass through a nanopore and changes in the electrical field are measured.
The magnitude of the current density depends on the composition of the nucleic acid which occupies the nanopore. Finally, this information about the change in the current is used to identify the sequence of nucleic acids. For this tutorial, we are going to use a total of 6 different samples, all of which are publicly accessible through the NCBI platform. Now we can start with our analysis. The first step is to create a new history and give it a proper name.
For example, SARS-Cov2 Assembly. Now we should create two new datasets, listing the accession numbers of Illumina nanopore reads that we are going to use for performing the genome assembly. We shall open the uploader tool, paste the accession numbers, give a proper name, and set tabular as type. Then start. Now we shall do exactly the same, but the Nanopore accession numbers. Just rename as Nanopore Accessions, and set tabular as type.
Then start. It’s recommended to tag all datasets, because it will allow us to track the two branches of analysis. Open the dataset, click on the edit dataset icon, and add the nanopore tag.
The same for the Illumina dataset. The next step is to retrieve the datasets from the NCBI database by using the *Faster Download and Extract Reads in FASTQ* tool. This one. Select the input type, List of SRA accession, and execute. Finally, ince the Oxford Nanopore data only contains single-end reads, and Illumina data only contains paired-end reads, we can remove the empty files from the history panel.
In order to do that, we shall activate the operation on multiple datasets, select the empty files, and delete the datasets. Perfect. And now we can rename the datasets: Nanopore reads, and Illumina reads. Quality control, read trimming and pre-processing are essential steps required to guarantee accurate results from our RNA-seq analysis. Due to the very different nature of Illumina and Nanopore, reads they should be processed using very different tools. The Illumina reads should be processed by using the Fastp tool and the result will be analysed using the MultiQC tool.
On the other hand, the Nanopore reads will be processed by using the Nanoplot tool. Let’s go with the quality assessment. We shall select the Fastp tool. Paired collection. The Illumina tool. Now we will configure the tool to retain reads only if at most 20% of the bases have Phred score quality score higher than 20.
And if the length of reads after the trimming are at least 50. And execute. This process can take some time.
Meanwhile we can launch the Nanoplot tool. Search for Nanoplot in the search bar. Select Nanoplot. Now we shall select the correct file, Nanopore Reads, and in the option for filtering we should select the Logarithm Scaling of lengths in plots. And execute. Once the process has finished we can analyse the results using the MultiQC tool.
Look for it in the search bar. Select Fastp as tool. Select the dataset. We can give a name to the report.
And execute. Now we can analyse the result of the quality assessment by using the MultiQC tool. For example we can see the percentage of reads that have passed the filter.
Also how much are considered low quality. It also provides information about the duplication rate. The insert size which is around 150bp.
Sequence quality before and after filtering, with the Guanine Cytosine content. We can seeclear differences after processing the reads. We can also have some information about the unknown basis. Now we are going to analyse the report generated by the Nanoplot tool.
It provides so much information about our reads, such as the mean read length which is around 345 base pairs. The mean read quality which is around 9, much lower than the Illumina quality values. It allows us to characterize quite well our datasets. Also it provides different plots which can be used in publications. Since the samples were obtained from human tissues, it is necessar to remove the potential contamination by retainint only those reads that don’t map to the human genome. As we quality control differential characteristics of Illumina and Nanopore reads we are required to use different tools for mapping the reads.
Thus we will use Bowtie2 for mapping the short reads. This tool is optimized for the read lengths and error rates typical of the illumina sequencers, and for the long reads we will use Minimap2. Let’s go to map the Illumina reads.
We shall look for the Bowtie2 tool. Now we shall select “paired end collection”. Select the output of the Fastp tool.
Now we should select the reference genome. Homo Sapiens, hg38, which is the last revision. And save the mapping in the history, and finally execute.
Now we are going to map the long reads, and to do that we are going to use Minimap2. We should select the same reference genome. Single reads. Now we should select the Nanopore dataset, and Oxford Nanopore reads to reference mapping, and we can leave the rest of the parameters as default, and execute. Once the sequence have been aligned, we are going to use the Samtools stats in order to generate some statistics from the minimap alignment. We shall select the long (reads) file.
Select separate datasets, and summary numbers. You can leave the rest of the parameters, and execute. Now we are going to compare the statistics generated by both alignments. We can open one of the datasets generated by BowTie2, and one of the datasets generated by Samtools Stats. Like this. Now we can compare both.
It provides a lot of information such as the raw total sequence, and the reads mapped. We can see most of the reads didn’t map which makes sense because most of the reads belong to the viral genome; well it provides a visionary information such as the maximal length, and if we compare those results with the ones from BowTie2, we can observe a similar trend. Most of the reads didn’t map. The next step is to extract the human reads from our sequence.
In order to do that we are going to use the tool Samtools view. We should select the alignment generated by BowTie2. Then in the require these flag we shall set the Read is unmapped and Mate is unmapped, and we can leave the rest of the parameters as default. And execute.
Now we are going to repeat the process, but using the reads generated by Nanopore. In this case we should select the Read in unmapped and leave the rest of the parameters as default, and execute. Before starting the assembly it is necessary to carry out a transformation on our data to adjust its format. First, we’ll use the Samtools FastX tool to extract the sequence in FASTQ format. Next we’ll merge the dataset lists into single datasets by using the Collapse Dataset tool. Finally, we’ll generate a subset of the Illumina reads by using the Seqtk_sample tool.
Once we have removed the contamination from our sequence, the next step is to extract the reads in FASTQ format from the BAM files, since this is the format required by the alignment tools. We need to select the Illumina dataset, compressed FASTQ format, and as outputs READ1 and READ2. And execute. Now we shall do exactly the same, but with the Nanopore dataset. In this case we should select unspecific as output.
And execute. Now we’re going to combine the list collection into a single file dataset. In order to perform this operation we’re going to use the Collapse Collection dataset tool. Now we should select each of the datasets, and execute. We shall repeat the same process for each one. Execute, and then the last one.
Done. Finally the last step before assembly our genome is to downsample the Illumina sequences. This step is optional but if you don’t have so much time I recommend you to do it because assembly of the full sequence can take around 10 hours. In order to do that, we are going to use the SeqTk_sample. We should select the Illumina datasets, and execute. Finally everything is ready for performing the genome assembly.
It can be defined as the process whose objective is to reconstruct a genome from the reads obtained by sequencing technologies. In this practice we will perform a De novo genome assembly which refers to sequencing a novel genome where there is no reference sequence available for alignment. Two common types of de novo assemblers are greedy algorithm assemblers and graph method assemblers. Greedy algorithm assemblers basically finds overlap between reads and then builds a consensus sequence from the alignment of overlapping reads. It is relatively efficient but however it does not work well for large reads.
On the other hand graph method assemblers represent reads as a set of nodes, and overlaps between these reads are directed edges which connect these nodes to perform a complete graph. Graph method assemblers are characterized for being computationally more expensive but they perform well on large read datasets. We’ll use an assembler based on De Brujin graphs, which is the graph model used by most assemblers. During the assembly process, reads are broken into small fragments of a specific size into K-mers, which are used as nodes in the graph assembly. The nodes that overlap are then connected by an edge, which represent the reads.
An ideal assembly corresponds to the path that travels to every node exactly once. For performing the assembly, we’ll use unicycler, a software tool designed specifically for hybrid assembly of small genomes. Unicycler employs a multi-step process that utilizes a set of software tools. Spades is used to assemble the Illumina reads into an assembly graph. On the other hand, the Nanopore reads are assembled by using a combination of Miniasm and Racon. Then both assemblies are merged.
And finally, Pilon is used to fix mis-assemblies. Let’s go to use Unicycler. Then we shall look for the tool in the tool search bar. We should select the reads, the expected number of linear sequences, which is 1, and execute. It can take some time.
Once Unicycler has finished, we can use the tool Bandage. It will allow us to explore the assembly graph through summary reports and visualization of the content. We should select the final assembly graph, and execute. Now we are going to generate an assembly graph visualisation by using the Bandage image tool. We should select the final assembly, set the core parameters and execute.
Finally we are going to have a look at the statistics generated by the Bandage info. It provides a lot of information. Probably the most important one is the largest component in base pairs, which is quite similar to the expected size of the virus, which is around 30 kilobases. And that’s all - I hope you enjoy!