Antonio Baeza, Virtual Asilomar 2021
hello my name is antonio baeza i'm an associate professor at the department of biological sciences clemson university and in this talk i want to give you some information about testing the utility of low-pass nanopore long-range sequencing for mitophil mitophilogenomics and barcoding research using the canadian spine lobster panillus argus as a model system i want to start by mentioning something that is not no no news to you really the mitochondrial dna the entire chromosome or small fragments of it a represent or are a marker of choice during the last 20 years to understand population genetics phylogeography connectivity demographic history among many others in marine and terrestrial vertebrates and invertebrates what i want to focus during this talk is about the methodology and how to improve the retrieval of entire mitochondrial genomes profiting from third generation sequencing so i was mentioning that a short fragments or the entire material genome is a marker of choice it has been a marker of choice during the last 20 years and usually what you do to retrieve information from these markers is pcr and sign your sequencing if you're interested in a retrieving or sequencing one or a few fragments everything is straightforward a bench work is not too heavy but turns out to be time consuming and heavy if you're interested in sequencing the entire mitochondrial genome of a invertebrate or a vertebrate species that's why it's important or that's where we can profit nowadays from second generation sequencing technology technology uh showroom sequencing from illumina for instance the start or initial read sequencing error has improved so much nowadays that using a low pass or a low coverage strategy we can sequence an assembly in termatocultural genomes spending not a considerable amount of money the problem with illumina is that it's very time consuming you can start a project or from starting a from initial sampling of a specimen or um purification of dna of dna all the way to that analysis that project might take several months or sometimes half a year it's very important for me for answering many apply any question many apply questions in conservation genomics to re to be able to retrieve markers very quickly excuse me unfortunately illumina does not allow us to do that and that's a reason why uh several colleagues during recent years during the last five or maybe three years are increasingly relying in third generation sequencing technology to be able to sequence complete mitochondrial genomes nowadays we have long read third generation sequences technologies such as the oxford nanopore a minion or a devices developed by by pacific biosciences biosciences that indeed allow us to produce a relatively large amount of data actually not relatively is indeed in absolute terms an impressive amount of data in relatively short periods of time the problem with third generation civil since technology is that still they start the initial sequence error rate is too high considerably higher than that uh error reported for illumina machines the other problem especially especially with the pack bio is that the device itself is extremely extremely expensive uh you need very high quality dna for library preparation turns out to be to turn it turns out expensive to retrieve that high quality dna from samples on the one hand library preparation is also time consuming and again as i was mentioning this the technology is very expensive on the other hand and on the other extreme of these third generation sequencing technologies devices we have the oxford nano port devices especially the minion the device itself is cheap relatively speaking and very cheap when compared to the machines developed by pack bio nowadays we don't need extreme high quality dna to prepare the libraries the protocols have been well developed for the extraction of dna or rna if you're interested in a sequencing rna in these devices and we have the possibility of again sequencing a in a device for 24 hours or 40 48 hours so in a relatively short period of time have a large amount of information and then use that information for assembling mitochondrial genomes so i see a dominion as disruptive technology it has the ability for instance to provide access to sequencing technology to uh countries a low income or moderate income country so that colleagues in these countries in latin america for instance can have access to these tools so to answer important applied conservation genomics locally not relying on developed countries on the one hand so what i want to do in this talk is a i have three main goals the first one is i want to test if mitochondrial genomes can be sequenced using low coverage long with narrable sequencing data exclusively i want to highlight here that i'm very interested in the low coverage idea here we try we are trying to optimize resources we're trying to minimize the cost for sequencing mitochondrial genomes so we are interested in testing if we can retrieve entire accurate mitochondrial genomes using low coverage data low pass sequence in whole whole german sequencing and also i want to highlight the idea of exclusively uh there are several studies and they are increasing every month using long reach nanopore and park bio together with shore weeds for the assembly of large genomes including the human genomes bacterial genomes vital chromosomes however to the best of my knowledge a four mitochondrial genomes there's no studies so far especially in the arthropod a really a very diverse group of invertebrates that have been that have that has used a long reach exclusively to retrieve mitochondrial genomes the second goal is to explore the accuracy of this long read assemble assembled in mitochondrial genomes and i will do that here by benchmarking the long read assemblies versus uh what we call a gold standard reference mitochondrial you know obtained beer using illumina shore weeds from the same individual using the same dna a for the assembly of a long read mitochondrials and last but not least i want to test if these long read assemblies mitochondrial genomes irrespective of the accuracy that might be high or low we will see are useful for mytophilogenomics and for barcoding research now to accomplish these goals i have been using a uh i will be using the caribbean spine lobster panel use sergus let me give you a little information about these species uh this large clawless or spiny lobsters is the most economically profitable feature in the canadian the industry is approximately a one one billion dollar one billion dollar industry so you can imagine that there's a lot of local communities and people relying on this resource all over the canadian on the one hand unfortunately all the populations in this species are either fully exploited or over exploited and there are serious issues of mislaveline at multiple steps in the supply chain of this resource so we are able to tackle that problem we should be able to improve the management of this fishery towards the goal of sustainability and also to try to understand the effect of overexploitation on these different populations so as you can see there's many reasons to generate genomic resources for this very important species marine species in the caribbean and of course i want to highlight that this is a species very important from an ecological point of view this is a prey of many predators in coral reef systems in hard bottom environments shallow environments in tropical or subtropical areas in the caribbean including the florida keys animals all the way from sharks among vertebrates to octopi among invertebrates prey on small lobsters or on individuals of moderate body size on the one hand and these species also um attain a relatively large amount of energy from lucini clams that rely on chemosynthetic activity actually the point here is that this lobster is transferring energy a between different communities um in shallow environments between seagrass metals where lucinics the syringe clamps live and coral reefs but also is a diverting energy from chemotherapy environments to other communities that mostly rely on photosynthetic activity so again from an ecological point of view generating genomic resources is very important in this species too and here we're interested in we're focusing on mitochondrial genomes so let me tell you first about testing if a complete mitochondrial genomes can be sequenced from low coverage non-regenerator sequencing data exclusively and i will speak also about the accuracy of these mitochondrial genomes if we are able to retrieve them so the first thing that we the first thing that we did in my lab was to generate a gold standard we uh sequence using well established protocols in um the material genome of these species using short reads we extracted dna and you see using illumina technology um specifically almost almost half a billion reads of per apparent short reach illumination rates um we were able to assemble the mitochondrial genome of this species the tools that we use were several including mitobeam including novoplasty and including gate organelle novoplasty and organelle are bioinformatic tools especially specifically developed to assemble mitochondrial genomes from shore read data sets the amount of data of reach that we generated allow us to assemble a high quality high quality mitochondrial genome with a coverage between 7 710 and 830 x depending on the pipeline that we use we annotated this circular mitochondrial genome assembled by novoplasty and by gate organelle with mitus and mitus2 and the annotation show us what we were expecting in these mitochondrial genomes we have 13 protein coding genes 2 ribosomal rna genes 22 transfer rnas and also a relatively long region that is a putative control region and this is what you expect for any arthropod this is what you expect for any crustacean and any spine lobster in particular and the point here is that we are able to as expected assemble a high quality go high quality gold standard mitochondrial genome using short reach if we compare the g if we compare gene synthetic in this mitochondrial genome we will we discover that it's exactly the same reported before for all the other species for which other collis colleagues have assembled mitochondrial genomes and i'll reason again to indicating that the quality of this mitochondrial genome is high and that we can use it as a gold standard so to compare the quality or the accuracy of long read mitochondrial genomes so let me tell you now about the different strategies that we use for in a to attempt in an attempt to assemble a mitochondrial genomes using using low coverage nanopore long reach uh using standard protocols well established we extract the dna we prepare libraries we back barcode them a and we sequence the library that we were able to produce for the helium spike lobster same individual from which we obtained short reads uh we sequenced this library in a single manual minion for a total of 48 hours with five other libraries and again the idea was here to produce in a relatively short period of time a data set that we consider is a low pass or a low coverage data set and the question then is can we using this data set can we generate relatively quickly a mitochondrial genome for the same species for the same individual so tell let me tell you a little about the pablo the pipeline first we were able to obtain a total of seventy thousand seventy seven thousand five hundred twenty reads after running in the sequencing for 48 hours base calling was conducted with albacore for triumph for trimming we use pork chop on for quality control and filtering of this wreath of these rich we use a fast t before telling you about the different assemblers the different assembly tools that we use to attempt to retrieve the metabolism of these species i want to mention that we use albacore the base calling happened during the year 2018 and at that time the version of goopy wasn't performing better than albacore and because of that we just decided to go to albacore that at that point in time was the stylish base color or one of the most efficient base colors um that we didn't have access to so about the assembly of this the assembly strategy we use three different tools fly and unicycler for the no assembly and rebuilder for reference based assembly in fly we polish the assembly with the flight polisher tool part of the fly pipeline one five or ten cycles for one five or ten cycles the idea here was testing the effect of the number of data of polishing it iterations on the quality on the accuracy of the assembled molecule a mitochondrial chromosome for unicycler we use three different strategies both normal and conservative they are part of the unicycler pipeline and for reballer we use three different sequences or species as reference we use the short read assembly that we just generated for the same species as a reference and also we use palindro signals and ornatus as other two reference species uh palindrosol ornatus is a species that is relatively distantly distantly related to argus compared to signals so the idea here was to test for the effect on the genetic distance of the reference species on the quality or accuracy of the assembly now uh if any of these three strategies and tools were successful in assembling a and circularizing the mitochondrial genome of the caribbean spiny lobster then we predicted the presence and we expected to observe in the assembly graph of each one of these produced by each one of these tools a circular molecule that is approximately the same size than the short wreath approximately 16 000 base pairs so we visualize the assembly graph of fly unicycler and rebounder using the software bandage we identify a circular molecule of the correct expected size and then we blasted this sequence discontig against the ncbi non-redundant data database to confirm identity of the singular circular chromosome and if we were successful we finally did a final with a final extra polishing using medaka let me explain here in the idea here we know that melaka has been established or has been developed for a polishing context or assemblies when base calling actually is conducted with guppy but this this was an experimental part of the protocol here we wanted to know if medaka helps in increasing accuracy when device color the base color is available and the model that we use within medica was the most similar one to the version the albacore version that we use for base calling and the news is that just let me give you a quick mention some good news medaka actually does improve accuracy of assemblies even when we use albacore for this cut a base calling i will sure show that in short so if we were successful in a retrieving complete mitochondrial genomes using the strategies that i did mention uh what we did was use different a metrics for estimating accuracy uh completeness was one of them the length of the mitochondrial genome we expected if it is a highly accurate to be similar or identical to a gold standard mitochondrial genome that we generated into the study coverage was another metric and finally sequence identity and specifically in this study we use patrice patrice patristic distance as a proxy for a sequence identity we see we align using the software muscle each one of the long read mitochondrial genomes against the short read gold standard mitochondrial genome and then we calculated and corrected the distance and the idea here is that if if the distance is zero then the long read assembly is exactly identical to the short read gold standard assembly and the larger this number the more dissimilar is the lonely assembly to the show read assembly so in this first table uh we have this metrics that i was mentioning you the first thing that i want to highlight is that either using fly unicycler or revalor all the strategies that we use did allow us to retrieve complete circular mitochondrial chromosomes of the species that we're studying so that is good news we remember we're using a low coverage strategy here and the retrieval of the chromosome is relatively quick the length of the long read assemblies was very similar regardless of the strategy we use to the reference or gold standard mitochondrial dna coverage was very similar vary approximately between 30 and 40 x and p distance also was very similar so in general uh all the strategies were that we used were successful in assembling mitochondrial genome and the accuracy looks like was very similar among the different assemblies so to the question can we assemble mitochondrial genomes quickly using long read nano core data data sets the quest the answer is yes we can the next point is how accurate are these mitochondrial genomes and i want to highlight that benchmarking of these long reassemblies in mitochondrial genomes or by even in bacterial genomes is not that common it's not that frequently reported so what i want to show you here in this graph is the number of sequence heroes of this long read mitochondrial assemblies depending of the tools that we use and the different pipelines with or without medaka with fly different number of polishing cycles with unicycler different strategies and we re rebuilder using different reference sequences this is the number of sequences sequences errors and we have classified each error according to error type more specifically substitutions one-by-square insertions short insertions or deletions tool and then we measure if ears also wear one base pair to five vapors homopolymer insertions or one to five base pairs homopolymer deletions and one piece of information that you can retrieve quickly from this graph it becomes obvious when you see the dominance of the green color here in this regardless of the assembly and the strategy the green the dark green color here is dominating in this graph that means that the most common sequence error for the long read assemblies was one base purse homopolymer deletion deletion and that's something that has been rarely reported but has been actually been other colleagues have reported that this to be one of the most common errors when assembling chloroplasts with a nano porcelain scene the second most common error was a one base pair base pair on a polymer insertion that's the orange color that we see here and the other type of ears were at least were not that common so again very common error homopolymer deletions or insertions on the one hand and the other piece of information that you can extract here is obvious when you compare bars side to side high low high low high low and that pattern repeats and that's the effect of medaka of extra polishing with medaka that i was mentioning even though we are not using the exact proper model for extra polishing still medaka is improving accuracy when you do base calling with albacore on the one hand and this improvement is particularly obvious for the fly in assemblies so we definitely recommend if you're interested in the future if assemble mitochondrial genomes using um nanopur long reach i will argue fly is the way to go uh turns out that we also tested canon and canoe several strategies we were not able to retrieve a circle a circular a mitochondrial chromosome with canon we don't recommend canon and we do know that canon does have issues retrieving small circular circular molecules in the one hand so in general long-range genome accuracy a using nanopore low coverage data sets is very high however this accuracy is not a hundred percent and in some way the ultimate test of a long read assembly accuracy we can have it if we annotate each one of these long-range genomes using mitosa and mitosis ii again state-of-the-art bioinformatic glue tools available available online for the annotation specifically of mitochondrial genomes when we did that for the long read assemblies something very we obtained some very interesting results this is again i already show you the annotation for the gold standard short read assembly but for each one of the low read assemblies a all of the strategies we use the number of errors was such that unfortunately the annotation with these tools wasn't totally reliable in all of the genes but one or two you know but one or two of the protein coding genes in each one of these long-range assemblies did have at least one internal stop colon usually many that did interrupt the open reading frame in each one of most of these protein coding genes and that is indicated here by the black um tone or color in these genes so only a few genes actually were properly called by mitosis if you study you go and visualize or if you manually do this annotation you will see that the gene is there but the open reading frame is really disrupted and that is producing that the bioinformatics tool is actually not recognizing the presence the position of that gene in the mitochondria however it is there so although highly accurate the errors containing the long reassembled mitochondrial genomes preclude in general generating a reliable annotation with meters and meters too and that has a very important implication i will argue that at this point in time let's argue during the year 2018 using the bioinformatic tools in this study in the quality of the genomes assembled using long read though is very high is accurate is not that accurate so for instance to proceed to study selective pressures in protein coding genes unfortunately it is not there i do realize that nanopore sequencing technology initial sequence error is improving by the month or by the year uh so the hope is that maybe in two or five years the quality of the genomes mitochondrial genomes that we can generate we can retrieve using nanoprolonged read will be high enough to proceed to do selective pressure analysis with the data sets and within with the sequences that we generate using long reads so last but not least uh irrespective of the problems with accuracy on these mitochondrial genomes the next question that i was interested in a asking in answering was if this long read assemblies that can be retrieved quickly and hopefully relatively cheap it can be used for mitophilogenomics and for barcoding research and then again these type of studies are really important for understanding connectivity for instance in the caribbean spine lobster or for detecting mislaveline on the other hand and that will improve that that will help us in managing the fishery uh more efficiently and towards the goal of sustainability again in the future so this is a first a phylogenetic likelihood analysis that we perform a in the lab using a total a total of 42 terminals this analysis rely only on protein coding genes mitochondrial protein genes we use a total of 11 187 nucleotide characters almost half a little more than half of them were parsimonious informatives we use iq3 for a phallogenetic inference and in this fellow genetic tree we can see that all the long read assemblies together with the gold standard cluster into a single phallogenetic clay fully supported by bootstrap analysis on the one hand and disclaimed his sister to another clay comprised of panilus japonicus and signos um and that actually on the one hand tell us that first that long read assemblies although they have problems about accuracy enough to prevent selected pressure analysis they do have high quality phallogenetic informativeness that will allow us again to conduct this type of analysis with this type of data in the future the other important message that i want to a comment comment here is that the bootstrap values didn't decrease considerably towards of the root of this phylogenetic tree that is indicating that philo or mitotherogenomic data or protein coding genes of mitochondrions exclusively maybe can be used to reveal phylogenetic relationship among closely or distantly related species in a lobsters and the point is that definitely we know that the ground the mitochondrial chromosome is just a single marker we are interested definitely having in using nuclear genomes but my point here is that mitochondrial genomes can provide you with a lot of reliable informa information to reveal uh phylogenetic relationships among groups of closely or not so closely related species in this case in the same genus family or even infra order in the case of this crustacean is claimed a second analysis that i want to show you is again a phylogenetic analysis in this case what we did is we retrieved from jinbang the totality of the sequences belonging to the acetochromocytochromoxidase one gene available in gene bank and towards the um in during the end of the year 2002 we add the our sequences for a retrieve from long from the long reassemblies from the gold standard to conduct this phylogenetic analysis and what you can see here as indicated by the black arrow tip here is that again all these co1 fragments or cox 1 fragments retrieved from long reassemblies plus the gold standard fragment cluster together into a single well-supported clay that was contained within another clay comprised of exclusively specimens of panilurus argus and this clade was clearly differentiated from others comprising of sequences belonging to other species closely or not so closely related related um a lobster species so again the question is a can these fragments retrieve from a barcode or retrieved from laundry assemblies can be used for barcoding analysis although we know they are not completely accurate the answer is yes we can use them i want to highlight that in the great majority 99 or more of the sequences available in being genbank uh of this cox 1 gene actually are sequence you see our sequence using pcrs and your sequences so these are relatively high quality sequences compared to these long width sequences so a combination of them a combination of them for this analysis can be permitted in some way so uh a main point here is that of these last two analysis although not completely accurate long read mitochondrial genomes can reliably identify in our case the sequence is specimen as belonging to the species that we're studying paneer lewis argus and also can differentiate our specimen from other closely and even other distantly related species in the same genus family and infra order so again the answer to the question can we use a long read assembled mitochondrial genomes for barcoding and mitophyll genomics the question the answer is yes before i finish let me tell you let me give you three main messages first this study i hope a turn to turn to be a proof of concept for the future implementation of in-situ surveillance protocols using the mino dominion to detect missile missile island in the caribbean a spiny lobster across across its supply chain the idea again is having markers that maybe they are not perfect but they are they have high accuracies accuracy and that they are cheap and that they can be retrieved quickly to er surveil missile so to improve the demand management of this fishery the idea also with this studies to hopefully decrease the costs the cost for exploring meta population connectivity in the caribbean spring spine officer we have already developed a panel of snips based on 2b rat sequencing that's a study that we published a couple of years ago and the ideas to use is to use mitochondrial genomes together with the snips to understand um connectivity among populations in this species uh migration demographic history and of course hopefully adaptation mostly relying on snips and last but not least actually uh this study the idea with this studies to hopefully use this specifically notepad bio but maybe nanoport sequencing technology so to transfer it to colleagues in moderate or low-income countries the caribbean spine lobster is a present in all the caribbean coasts of central america the caribbean and even north south america there's not too many resources in many of these countries and the idea is my lab going and providing this resource and hopefully together with colleagues in canadian and central american countries under to understand the connectivity of this very important species if you are interested in additional details about this study please you are welcome to download this paper that was published at the end of the year 2020 on the genomics is open source of course and available for you for free in the internet and also i want to mention that as we speak we are assembling the nuclear genome of the caribbean spiny lobster and as we speak we are finishing we are finalizing sampling of these species with colleagues in other countries in the caribbean and latin america so to explore or to uh understand meta population connectivity in this species local adaptation um and again with the goal of improving maybe the fishery of this really important species if you have any questions please visit my website at clemson university or send me and send me an email thank you i
2021-01-16 13:09