Lecture 5: Limits of Technologies
technologies and this is kind of the introductory slide you've heard a lot from zach i think that was last time like a week ago or two weeks ago about microwave technology and i'm sure that he gave an ex extremely inspirational and the enthusiastic talk about the possibilities and scope of this technology but let me give you just a little dislike so whenever a new technology appears or for example microwave technology or the genome project of course first there is a general brilliance and optimism that all problems will be solved within a couple of years there are lots of reasons for this optimism one is that you want investors in your company and you want public funding but of course in a couple of uh month or years um realistic expectations start to um appear and then you have to start to think about limitations of the actual technology and actually that's the reason that we have a talk about this very topic so when we talk about limitations of the technology and we have to define what you actually want to do in science i mean of course there are lots of very various definitions of science but in a sense you would like to make predictions about some sort of a system and we are going to talk about limitations in those terms how is it going to limit your predictive power so um when you talk about limitations you can talk about your how accurate your measurements there are limitations of the accuracy of the measurements accuracy and noise but there are limitations in terms of sensitivity what are you measuring how complete is your measurement and of course i'm going to be very briefly touching on this that even if you measure everything very accurately there are inherent limitations in your predictive power you cannot predict everything i think well think about unpredictability in terms of chaos uh even if you measure everything very precisely there are systems that you can you simply cannot predict um how it's going to behave in a analytic way so noise first i'd like to define what is noise and what is signal noise is an inherent feature of complex systems and noise in continuous and discrete measurements noises the limitations of the technology and of course we need to talk about what can be done about noise that's what statistics was invented for and i'm going to talk about briefly about normalization so what is noise there are different different definitions this is i of course took from webster and let's just look at this point e which is an irrelevant or meaningless data output occurring along with desired information now you should be aware of that that noise is not always a bad thing sometimes noise may turn out to be a very important signal so and what is your background i'm just biology sure listen okay uh so this is not probably not the best example for you but there were two guys uh two radio astronomers many many years ago i think that was almost 50 years ago we were looking for signals and uh they just saw this this noise coming from every direction in the universe as really as turmeric and that turned out to be the cosmic background radiation which is one of the most important discoveries and these guys are actually got the nobel prize a couple of years later that was starting with pure noise and they were trying to get rid of that noise they couldn't and that led to discovery of the cosmic background radiation but if you think about medicine then for example the way cis platinum was discovered as a chemotherapeutic agent the what happened is actually that the electrodes that they were using in those experiments contained platinum and they saw the effect of come in actually i just wanted to close the door because i know that's the voice so um and then they tried to figure out what was killing the cells what was slowing down the growth and then they realized that actually it was uh the platinum containing those electrodes and this is the way they discovered cisplatin so the point is that noise is not always a bad thing now what if you've seen as noise or error in measurements in biological measurements might be a key component of biological processes so um of course mutations in evolution are extremely important and when we will talk about discrete measurements one form of that is actually when you're sequencing you see lots of noise in the human or all sorts of genomes that's called junk dna well we do not really know uh what this junk dna is for then no just go ahead i know that yeah i shouldn't have said go ahead it's still the introduction so um we do not know you perceive this junk dna when you're trying to find genes or axons and interns and transcription factor binding site but and that is going to be extremely it's going to bother you very much when you're trying to find these two signals in a genome we do not really know what junk dna is for but there might be a good reason it's there it might be determining the spatial distance of different genes or whatnot another type of noise is which seems to be very important is during differentiation very often you see an asymmetric cell division so rna or proteins are divided or distributed between the two daughter cells is symmetrically and that is actually done or that happens more or less by chance and the the two daughters errors are going to go one way or other depending on what sort of how much rna or or uh protein they got this you can perceive as some sort of a noise measurement if you do a single cell measurement and stochastic fluctuations may be very important for the stability of complex physical chemical systems uh i might be talking about stochastic stochastic genetic networks and robustness much later in april when i may be talking about modeling and uh less suffice now that stochasticity and noise in complex systems might be a very important feature to maintain the stability of that system you should be aware of that of course that uh genetic networks and and uh in general biological system or stochastic systems because you know that for example uh you have only a couple of hundred copies of a given transcription factor per nucleus or even less sometimes you have only 50 per nucleus the interstellar environment is not a free solution and the reaction kinetics is often slow and what it means that you have a stochastic system in this case that if you have a completely deterministic system then from any given gene expression pair on any given state you can go to honor to to one state to one another straight that's a deterministic system whereas in a stochastic system from any given state you can go to different states with different with a certain probability so that's what we mean by stochastic systems now if you have this you have a stochastic system in biology then what you when you are measuring gene expression levels or protein expression levels or any activity of any biological parameters you will perceive that as a noise in your measurement now is it true or is it this is really relevant to biology that you have stochasticity present in the system and this is a paper that came out now almost two years ago but actually they wanted to measure this this was not in bacteria but uh recently similar studies were published in yeast as well and what they did they took two proteins uh from e coli and they put on a green and two different um gf two different uh green fluorescent proteins and so they could measure the expression level of two proteins and they set up the system in a way that if it was deterministic then it was under the very same promoter the it was both genes were expressed or driven by the very same promoter and it was a very carefully setup experiment so if the system was deterministic then what they expected that the expression ratio of those two proteins is going to be the same in every single cell now what they found that despite all their efforts to set up the experiment as perfectly as possible they found that the two different colors that means in this year it's red and green but of course these are only false coloring the point is that you have two different wavelengths where these signals are emitted depending on the actual cell you can have very green very red and some yellow cells as well which means that despite the careful way of this system of this experiment was set up these cells expressed the two proteins at a different ratio and that was due to stochasticity so the point is that stochastic fluctuations occur in living organisms they are trying to understand now pretty hard what's the relevance of this it seems that it has a lot of relevance but of course we are not quite sure what the implications are you should be aware of that whenever you do a biological measurement or what especially i mean these days technology might develop you're always measuring population average data so this is again is going to add to your noise when you are measuring genetic pressure levels or pro or do proteomics you of course grind down millions of cells or tens of thousands of cells and that is going to be giving you a certain level of noise as well and this is even true this is true even if single cells are quantified the reason for that is that if you have a stochastic network and you let's imagine that you can really measure gene expression levels you can do this for of course individual proteins in single cells but whenever you still do your measurement you usually interfere with the cell or you kill the cell so you don't really know how would that cell have progressed so since you interfere with the system and a stochastic system you can't really figure out what would have happened the system so you're actually ending you're going to end up with a population average data again so uh there is no measurement without noise as you know um it is usually the accuracy sensitivity measurement and that is i'm sure that most of you would be extremely troubled if you did a microarray measurement or some sort of a chemical biological measurement and in three triplicates you would get exactly the very same number that would mean probably to most of you that there is some sort of a systematic error in your whatever you're doing your photometer or whatnot because you would expect some sort of a spread of your continuous data so it is expected for continuous variables to data with a certain spread and you know that that's that's okay and that's why statistics was invented but you know that there is some sort of a true value of your measurement but due to little um you know fluctuations and whatnot you will have a spread around that true variable and of course the usually the question is and that's what statistics statistics at least frequently statistics is really conservative they need a variable change due to agreement treatment whenever whenever you have a spread like that so if you have a measurement here or here or here and this is your starting point then what's the probability that you're actually your parameter really change its value so what you need to do for this is you need to have of course lots of measurements and or a fairly good idea about the nature of the noise that's very important as well we are not going to get into this now but as you know that for example the easiest or the most convenient assumption is that you have a normal distribution it's it's good to have that because if you have that then you can actually make very simple calculations about the probability that whether your mean or whether your parameter is actually changed or not so um statistics was invented a long time ago and actually partly it was due to biological measurement and so statistics is concerned in biology with many many different issues one of it is what is the true value of a given parameter if there is one true value there is a very frequently used analysis that people or biologists are not really aware of which is kind of a bayesian analysis and actually this is the most frequently done um statistical analysis in biology this is the way all science works i have a certain belief about whether something is going to happen or for example an oncogene is going to transform a cell or not i make a statement and my job is actually to convince you or other biologists that this is actually true now you can what you can do what you usually do is actually you repeat the experiment and if you um see the same phenomenon then you are kind of updating your um you are going to update your your our common belief that what i was saying that was actually true so sort of bayesian statistics is always there in a kind of in a hidden way in all biological organ we are trying to update each other's um belief network regarding biology the third type of statistical analysis is that you don't really believe the measurements but you know that there is some sort of systematic error there and then you try to correct for uh this systematic error and that's what's called normalization and i'm going to talk about it in details there and there is a fourth issue in statistics when you actually producing a lot of lot of a lot of measurements so you're not looking for certain patterns imagine that you're looking for gene expression changes that is going to cause cancer and you have two populations of your samples normal and cancer and you see that certain type of genes are always downregulated or upregulated or mutated in cancer now this could happen by chance if you do not have a large number of samples and then and and this change is actually simply random so in a certain cells it's up in other cells is down then if you have the wrong number of samples then it might happen that just by chance with a certain probability in all normal samples that mutation is not present and in all disease samples or cancer samples it's going to be present so what you need to ask is that can that pattern that would explain your biology be present by chance and this kind of there are too many numbers and what you can do is actually you can try to solve analytically that's what combinatorics is about or you can do some sort of permutation and as you'll see later that's actually a pretty nasty problem when you're trying to apply for real life problems so biological measurements are often expensive and um i should probably i mean something i'd like to point out to you that if you follow the literature or when you will start to read the literature and i'm assuming you will because that's why you're taking this course you will see lots of nature science and high profile papers in which they ran a single microarray measurement on a large number of different cancer samples and then they are drying then they're making also so they're drawing all sorts of conclusions about which genes are important cancer which one is not and these measurements have been still rather expensive and it's not easy to come by and not easy to obtain the samples but you should be aware of that you cannot really do any statistics on that you should do some sort of asian type of statistics but whatever they are doing on this is not really statistics it's going to be bayesian i'm just going to say that i see this change very often and you either believe it or not but you can't really get any hardcore numbers out of it that you can use for any sort of statistical analysis or modeling so reliable numbers cannot be produced without replicas which is kind of obvious so the center problem is that in massive parabological measurements quantitative and quantitative codes are supposed to be made on a large number of heterogeneous variables using only a few replicants that's what you're going to see over and over again if you work on a large scale or massively parallel biology and this is one of the problems that technology and the analysis has to overcome so where is the noise coming from in microarray measurement so this is a slide i think you saw some sort of variation of this insects thought so this is uh how an atheromatic chip's a asymmetric dna microachieve works so you start with tissue and you extract rna and what you do is you have to do an rtr reverse transcriptase treatment or step on it that is going to translate back the rna into cdna and depending on how you do it you can either produce cdna or crna because during this process when you're producing the cdna or the crna fluorescence dyes will be incorporated into the polymers and then these are going to be hybridized to the specific probes present on the chip now the underlying assumption or expectation is that ideally one copy of a given rna will produce one unit of a specific signal if this were true then you would have very accurate measurements but now let's see what's actually happening in reality when cdn is produced from the rna using the rt this is an enzyme that has its own life its own characteristics so uh the initiation of the rt step is stochastic because you know as i'm sure you're aware of that you need an a starting primer that's going to be extended by the reverse transcriptase and very often the reverse transcriptase the enzyme simply drops off so that's what you see is actually when you do a microarray measurement you see usually a much stronger signal the three coming from the three prime end of the gene than from the fry prime and because as you're transcribing reverse transcribing the the message the artist starts to fall off so you always have much stronger signal from where the artist started which is always the rt because that's where what usually uses a poly a you're using the poly tag as your initiator you can use random primers as well and actually sometimes it's used for zooming bacteria but most of the time you start with poliae also crna which is used for the affymetrix chip is produced in the presence of fluorescent dyes and it's assumed that the dyeing cooperation or it was hoped or it is hoped that the incorporation is going to be linear and it's going to be it's going to be incorporated in equal probability but that's not the case scenery production the crne production is not linear there are there are messages that are transcribed into crna with a much higher probability much higher efficiency than others and the die incorporation is not linear either also um the athletic chips involved involves a step and you actually break down your crt for whatever reason this is the cheap design and breaking down the searing into small pieces is not going to be the same for all messages either and of course you have all sorts of other problems like hybridization or cross hybridization so and one can go on and on and on what would give you the noise is just a couple of samples but the point is that your final signal is going to be the sum of all the above for all these things and others so this is just to give you a feel for how many individual issues will arise when you're doing a microarray measurement of course the surface chemistry is very important the background subtraction and support so let's see another example this is the two color microarray the previous one on the affymetrix chip you heard about last time in which you actually couldn't get a single c or an a per chip and there was another uh competing technology invented at the same time when you actually label cdna of two different samples you mix the two samples and you're actually measuring the ratio of for each individual g so um in this case what you do is you have equal amounts of label cdna samples and what you're hoping for what you're what you're trying to achieve is that if that to if a certain message is present at the same level in both samples then the two intensity the signals are going to be equal so you're going to have like a kind of a yellow spot if a gene is over expressed or underexpressed you're going to have a stronger red or green color now the what you are ending up with in these measurements is a ratio and the problem is that actually there is no truly blacks the blank spot uh do you always have some sort of a a background noise there and then you're doing you're measuring the ratio then of course that non-blank spot is going to give you some sort of a false pseudo signal so if you for example if there's a gene that is not um present at all in a given sample and it's spreading the other sample then the ratio would be of course um infinitely high or it would be very very very high but you never see this you always have since you have a certain background intensity what you see is some sort of let's say 100 fold up regulation which in fact or in true in in in in truth might be a complete down regulation or or a there's a complete lack of that gene in one of the samples so this is perceived by the experimental as compressing the signals so you have a very wide dynamics of the ratios from minus infinity to plus infinity but what you actually you see and this is where usually most of these measurements are the ratios are caught is 100 fold up or down regulation on either side there are lots of experimental issues uh that can also um uh you know contribute to the exponent of noise so this is how the f matrix chip is designed you have seen this before and what you have since these are very short probes 25 base pair probes the way athematics tried to overcome this problem is that they designed a set of probes along a given gene using some sort of an algorithm and what they hoped for that when if you have lots and lots of lots of probes 11 or 16 probes per gene then from this set of probes you can somehow estimate the true gene expression level so this is how they are actually designing as you see here this is the entire gene and you're kind of tiling the gene across unique enough sequence regions of genes now the problem is this is coming from a real measurement that uh these up here are the perfect mesh probes that are supposed to measure the same gene to some extent you'd expect that all these expression levels all these intensities would be equal and very often for most genes for most profits this is not the case you're very bright and very dark probes there are lots of reasons for this um you have crna secondary structure and so forth and so forth but the point is that when you look a little bit harder deeper into what you're actually getting from these measurements well you're you're expected to estimate a true gene expression level from this set of intensities that can often vary by four or five orders of magnitude so uh that's how reality works these experiments so this is just another additional piece of information that it's not that easy to design these chips but because one can improve a lot and the gentleman sitting here could tell you lots of interesting stories about how these things are designed or how they are not designed by the manufacturer uh but that's that's different story so i was just trying to give you a couple of thoughts a couple of pieces of data about where noise is coming from in massive repair measurements in in real life but even if you had very good quality measurements you have other sort of conceptual issues in this field as well so let's assume that you want to use your numbers to reverse engineer a system or to do forward modeling more forward stimulation large genetic networks but you'd like to have very good quality numbers the problem is that when you do these measurements you always measure a very heterogeneous solution a better heterogeneous population of rna and proteins now when you started this measurement how are you going to normalize your numbers so uh how do you express your measurements even if your measurement is your technology is very good per unit rna per microgram rna the problem with this is that if you have a decrease in the level of a given gene and some g's are very highly expressed and then the message of other genes unavoidably is the relative increase of the level of other messages because let's say you have a million copies or let's say 10 million copies for only per single cell so if your highly expressed gene is now regulated then what you perceive in your measurement is unless you're actually trying to normalize for the extra copy numbers that some other genes are slightly upregulated so there are conceptual issues as well why you will have noise in your measurement but as i mentioned the real problem is the real issue is um the actual technology now uh what you can do with that is when you have a set of measurements you want to take a good hard look at your data to see whether you have some sort of a systematic area measurement these are a bunch of ethymetrics measurements real life real measurements in which what you see is the intensity distribution of all probe sets so what you have here is the measurement gene expression measurements on about 10 11 000 different genes all covered by a different probe and this is what you see as a distribution now what you see here is there is one measurement that's very strongly an outlier some other as well these are pretty much the same and imagine that you're actually running the very same sample let's say you have a single cell line and you're treated with different drugs then what you'd expect is essentially the very same distribution for each of these rna samples with a few differences a few variations and you have this guy here so what you can assume and this is actually what people usually do and the f metrics algorithm does that for some reason during this measurement the fluorescent dye incorporation wasn't as efficient or your fluorescent reader was miscalibrated or something else but a systematic error occurred so what you assume then is that the distribution for all these measures is actually the same so what you can do is you can start to shift your curves because you have a good reason to assume that these are actually all very similar distributions so what you can do is actually you take the mean or the median or all these curves and you shift them to the very same mean or medium and you just you simply decide where you're going to shift everything else and then based on that you re normalize all the numbers and when you look for differential express g's you work with those really normalized number because if you did not if you hadn't done this then you'd say that well every gene is down regulated which is obviously false so that's what normalization is about um so normalization in general is that you don't really believe the numbers that come out of your experiments and you hope or do you assume that you can actually improve those numbers by assuming that you have a systematic error that you can correct there are two ways of doing this one is that you assume that most or certain things do not change and the second one is that actually you have an error model so the first one that you assume that most certain things do not change is what you saw on the previous slide so you say most of these distributions actually have to be very very similar and you can ship the means or medians of these curves but sometimes the shape of the curve is going to be different as well and well you can have if you have this non-linear non-linearity of the dye incorporation then you not only assume that you can assume that is the curve is shifted but the shape of the curve is going to be different as well so you can do some sort of cubic spline fade lowest and you can try to change the shape of the curves as well and shift the means that most of the curves all the curves would look very similar and whatever remains as an outlier after always is done is your true outlier or what you perceive as a real outlier and in most cases actually that makes a lot of sense and it provides differential calls that can be corroborated by independent measurements this is the similar problem for cdna micro measurements in this case the red versus green ratios are not expected to show any intensity dependence but in most cases when you do a two color microarray this is what you see so these are the intensity and this is ratios and you see what you'd expect is a curve like this and that is what you see that means that the red and the green dye is not incorporated with the very same efficiency especially depending on the concentration or the uh the constellation of the individual gene species so what you see is that for let's say low copy number genes red dye is incorporated with a higher efficiency than the green one what you do in this case is actually kind of try to straighten it out because you're assuming that well for all genes the red and green incorporation should be the same so what you're trying to do is to correct for systematic errors and in in the case when you are assuming that your basic assumptions that most things do not change then you can choose a set of elements that will be used that sometimes there is a set of housekeeping genes which is a very shaky concept you're assuming that certain genes do not change let's say metabolism is going to change or structure proteins the genes associated with structural proteins do not change now this is used very often as i'm sure you've seen in northerners that they normalize it for gap the age or reacting well it's okay it's just very difficult to find a set of genes that is really not expected to change or you can choose a set of special control genes that for some reason you know that those genes never change in your system and of course then the next step is you need to determine the normalization function which is a global mean or median normalization or some sort of an indentation normalization if you want to learn about this more then actually there's a whole website and a chat room and whatnot and it's a whole cottage industry just trying to figure out what's the best way of mobilizing microarray the alternative is that if you come up with some with some idea about how the error is actually generated so this is the most popular error model which it is assumed that at low concentrations you have an additive error you have just simply a normal a white noise around your measurement at high concentrations you have a multiplicative error and actually for all noise you have the combination of two so if you make these assumptions that you can generate very good error models and the normalization based on that actually gives you a very similar result as with the previous assumption so actually these two uh methods it seems these are interchangeable at least for cbn microarray noise will limit the useful information content of measurements that's the problem that that's the the issue why you need to be aware of that so it seems that if you take all these microarray measurements then a reliable detection of twofold differences seems to be the practical limit so this is actually a very optimistic and not cross-platform comparison so if you do a lot large number of fmetrix measurements in all sorts of uh or cgna micromeasurements or a large number of very very different cancer samples then it seems that if you take all the information or all the useful information you expect all the infusion information from your measurements since that two-fold difference a reliable detection of two-fold difference is pretty much the uh the the limit it's possible that certain genes are going to be measured reliably um the higher accuracy but across all genes probably this is the experimental limitation and why is it an important issue again getting back to the issue that you're trying to predict how your system is going to behave let's assume that you want to figure out who is regulating whom starting with time series measurements so you're going to measure gene expression changes the protein changes along as in within a certain time frame so how would you design your experiments there were experiments done on the cell cycle of yeast or or human fiber breasts but of course you have to you choose your time window correctly so if this is the error of your measurement this solid line then of course you don't want to take measurements more often than the error of the experiment measurement allows you to do so if you know how fast genes are changing and what's your experimental error from that you can kind of determine a sensible reliable time window which seems to be the case that for example in yeast uh there is no point of taking more gene expression measurements more often than every five to ten minutes and in mammalian cells then more often than 15 to 30 minutes if you measure more frequently you're just simply going to run into noise and you're just wasting your money so that's the reason why you need to be aware of the noise limitations and when you know how what's your error or noise of your measurement you can make some sort of a backup of the envelope calculations of how much information you can actually extract from the measurement and what that could be enough for so moving on to the other issue of sensitivity and completeness um when you're trying to predict what's going to happen to your system then it's of course the question is that there is a trade-off or there is like this issue of how many parameters are we measuring and how many parameters should we measure if you try to predict whether a certain cell is going to a certain cancer is going to metastasize or not how many genes do you need for that if you try want to predict how a cell cycle is going to progress how many genes do you need to measure for that and how many actually are we measuring so for that we need to have at least some impression of how large are these networks so this is just the you know showing that it's it's pretty large this is like a graph representation of all interacting proteins yeast so in this case you have about five thousand proteins uh proteins genes protein modifications are all independent regulated so you can call them something something like bionodes and the caution estimate would be that for in each cell the number of bionodes are going to be under order of let's see a couple of hundred thousands this is coming from the fact that you have 10 to 20 000 rtgs per cell and you have let's say less than 10 post-translational modifications to the gene per protein and that would give you roughly this number of course that could be much more and much less in terms of whether you're working with spice variants or you actually need to measure only module the activity of modules but this is probably on the on this order to have some sort of a complete picture now we certainly don't have this so far but uh this is the way technologies develops and actually this seems to be the easiest thing to achieve you just simply put more and more and more gene especially as you as the genome projects are being completed and probably the coverage of the microarray chips or proteomics is going to reach the complete genome um in the next couple of years there is no real reason why it couldn't have been achieved all you need to have is the sequence information and you know set up the technology but so the completeness can be achieved in terms of if you work hard enough and there are tens of thousands of biochemists and biologists working on this you can sooner or later measure most of the biologically important parameters in the cell at least in principle that means that you can have a probe that would measure these but do we actually see signals coming from these when you're using microarray measurements and there's a general michael holland who did these experiments a couple of years ago when he just simply took microarray measurement and rt pcr measurements as well on a couple of hundred genes yeast and what he was interested is one thing for one what's the dynamic range of gene expression changes in yeast and what we found is that the transcriptability is in yeast is over six orders of magnitude uh what actually this means is that there are lots of genes there are lots of cell the the six other fragrances is very large and you cannot see this in every single cell because the lowest number means that 0.01 copies per cell so what you see that certain genes certain cells will express a single copy of a gene due to stochastic noise and only every 100 cell will express that so this is the the the this dynamic range of gene expression changes he was also interested in that if he measures the gene expression level of these three 400 genes and he chose important genes like transcription factors and he compares the different technologies partly pcr is fairly sensitive although at very very low concentrations you run into stochasticity but it's probably the most sensitive technology you can use and you compare it to microarray then how sensitive microarray relative to rtpcr and that's what he saw and what it shows you that this range of gene expression levels is completely this is not seen by microarray so this is well under the sensitivity of mercury what you see is that you start to see some sort of correlation between the micro measurement and the rt pcr at two copies per cell so all these genes are actually expressed and changing and probably are doing something important as i said most of the genes were actually transcription factors but they are not seen by microarray so sensitivity is a very important issue when you do microarray measurements well then depending on your technology you will have lots of genes that are going to be under the sensitivity of the technology i'm sure and it's i mean as new technologies are coming out right now this is going to be improved as well but this is another issue you should be aware of that that even if you do microarray measurements and you see lots of blank spots it doesn't necessarily mean that those genes are not changing or they are not present simply your measurement is not sensitive enough so the utmost goal of the technology is going to be of course single measuring a single copy oh sorry uh per single g but even if you are measuring everything accurately there might be problems with predictions and this is what i was referring to before and just very quickly um okay because your biology so what many years ago actually i think it happened here at mit a gentleman edward lawrence was trying to predict how the weather is going to change it was in the 60s and what he did is he took a few ordinary differential equations a completely deterministic system and he tried to predict how the outcome of this set of differential equations is going to change and what he was really shocked to see and a little bit later the entire scientific community was shocked to see that these three ordinary differentiations produce the behavior very sensitive to the initial conditions which means that if you just change a very low just smidgen of the starting parameters the outcome of the measurements was completely different and this ended up in scientific history as chaos theory you might have heard about bifurcations and and so forth the point is that even if you start with a seemingly completely deterministic system you might not be able to predict how that system is going to behave because of this very fact the small changes in the initial conditions can cause huge changes at later time points now we know that biology is not like that because biology is a robust system because we are sitting here when we are talking so many people think that a biological system is somewhere on the on the on the edge of the completely deterministic and chaotic systems but the bottom line is that just because you can measure everything very accurately doesn't necessarily mean that you're going to have very high prediction but let me give you a much simpler representation or example of the very same problem imagine that you already measured very accurately the gene expression level and at very high sensitivity of all genes or many genes in a variety of cancer samples and what you're trying to figure out is that what are the genes that are causing cancer and let's assume that you found this subset of cancer samples that are these are actually real measurements from melanoma and this is a let's say this is a subset of samples this is extremely malignant kills the patient very quickly and you also think that you found that group of genes is going to be responsible for that extremely malignant state but you need to ask the question as i referred to this before can this be due to chance because you have a certain a limited number of samples well just by chance if you randomly put in sorts of values you can see something like this sometimes you can find an analytical solution but more often you can you need to do some sort of computational solution so you permutate your data set and look for similar patterns and if you never find a similar pattern a similar group of genes in the permutated gene expression matrix then you say well this is not due to chance but this is not that obvious how to do it so analytical solutions can be sometimes found so let me just give you this very simple example so um this i usually um poses a little problem that you can solve at home um we had this problem that the dawn of microarray analysis while i measured gene especially measurements in different breast cancer cell lines and when we reached because it was very expensive when we reached six breast cancer cell lines we found 13 consistently misregulated genes up or down regulated genes and what we add is can this be due to chains or not so this was translated into a combinatorics problem that you have e different cell lines and gene microarray so the number of genes misregretted in in the i cell line and the question was that can we find k consistently misregulated genes across all these sum lines by chance so if you like combinatorics this is a nice little home exercise if you want to solve but so you can find analytical solution for this and this is very simple and you know this this could be solved quite easily and you get a fairly reliable number of this but what if more genes are involved and more importantly what if genes are not independently regulated they're the underlying assumption in combinatorics is that you're drawing your samples independ randomly and independently but in these gdas genes are co-regulated if the transcription factor is upregulated where the downstream genes are going to be upregulated as well or some of them will be upgraded and this is coming from real samples so what you see here is when you do a complete permutation then this is going to be the distribution of correlation coefficients for each gene pair now in real samples this is what you see so there is a high correlation of gene expression changes up and down which is kind of obvious because this is a genetic regulated network now the problem is that if you need to do this analysis and you ask the question is my pattern random or not or can this be present due to chance or not well your if you use permutated a randomly permutated gene expression matrix as your benchmark and in that case uh your analysis or your results or your statistical analysis could but can be of by orders of magnitude but six seven orders of magnitude relative to an analysis when you say well i'm going to permit a symbol but retain the overall dependency of gene expression changes if you do that which is not an obvious thing to do and take some computational tricks now you have a very different result noise and discrete measurements okay so what you have is that you found the pattern that a certain number of genes are causing let's say cancer and what people usually do is they do a complete randomization so you just swap everybody everybody and then you look for the same pattern and you don't find it you never find the five genes that would show the same pattern okay and you are happy now the problem is that this has completely destroyed destroy this permutation the co-dependence of genes right and in reality that means that if you have co-dependence imagine that you there are strategies that are very strongly co-regulated and other genes are never co-regulated that the ones that are actually correlated are not actually two independent genes but you in your analysis that should be one you could replace it by one gene right and this is what you should retain of course you don't have complete co-regulation and complete independence but you have a distribution of correlation coefficients that's what you see here so the way one should do this is create a large number of random matrices in which the distribution of correlation coefficients is something like this but apart from that it's random and then as the question is my parent present in this as well now if you compare the statistical power or the statistical confidence between these two matrices you can be off by five six orders of magnitude so something is significant and this is way below significance in this so that's the point it's not that obvious how to do these things it's just an important point but sometimes even if you have good quality measurements biology is going to present you with very difficult problems and this is actually present in in sequence measurements i mean the whole blast issue is about this as well so moving on to noise in discrete measurements which is the best example the easiest example is actually dna sequences so of course you have measurement error there as well you have sequencing errors with a certain probability let's say now it's probably down 2.1 percent but used to like between 0.1 of course the solution was a sequence a lot of course if you see a difference in your sequencing and it's not done with a single individual you're not quite sure whether you're seeing a single nucleotide polymorphism snip or a sequence together but if you work hard enough and sequence enough you will have some sort of a feel about the true sequence of a dna sequence now you end up with a very very very long stretch of layers right in the case of humans it's a three billion and what you need to achieve or what you expected from you is to find genes introns exons transcription factor binding site in this sea of four letters now how do you do that this is going to be an issue of noise as well if you had only genes like exons and interns or only exons and transcription factor binding sites it'll be very easy to find the problem is that you have lots and lots of sort of junk dna or intergenic regions you have no idea what they are doing and in those sometimes seemingly intelligible information will show up just by chance so have can be found that's why the real way of building genomes is not only dna sequencing because from that it's very difficult to find the number of genes actually if you look hard in the literature about the number of genes usually the number of genes keeps falling with time because actually they see that there are lots of erroneous predictions usually these gene prediction algorithms tend to er on the um on the side that would give you on a more liberal side it tends to give you more genes than actually it's present so what you're looking for is actually cdna for example cdna libraries for the same organism because those are the truly expressed genes so you try to bring together the two different databases and if you find a cdna well that cdna can help you to find the actual genes now the problem is that then cdna has to be expressed and if if you didn't happen to prepare a cdn library from the cell line in which that gene is expressed well then you won't have that gene in your cdn library therefore you cannot find it in your genome so health can be found and this the dna sequence information can be refined to large extent by all sorts of different databases data sources but there are lots of lots of unexpected issues in biology which are truly amazing they were completely unexpected and you would have never been able to come up with that idea simply based on primary sequence information and let me just give you two really shocking uh pieces of data that are actually pretty recent one is the wise predecessor of anti-sense transcription in the human genome why did they go what did these guys do why it's it's a it's a long story but what they found actually that they found in the human genome about 1600 actually transcribed antisense transcription units so you know usually how the sense uh how the genome is read and described in the sense way maybe just looking into it whether things are transcribed in the anti-sense way i mean you learned a lot i mean you heard a lot about micro any si rna regulatory rna so there is good reason why they were looking into this nobody would have expected that there is such a high number of actual antisense transcription units also when a group checked out what portion of a giving chromosome is actually transcribed they were kind of surprised to see that it was about that one order of magnitude more than expected what people usually do is you take a chromosome in case they check chromosome 21 and 22 you know where the majority of exons or internals are and based on that you expect that most well the exam exons are going to be transcribed and maybe a couple of regulatory rnas so you you have an expectation that let's say a couple of percent of your chromosome of a giving crosstown is transcribed now what they found when they actually covered the entire chromosome by an f matrix chip is actually 10 times as much information was transcribed from the dna than expected based on axons again you will having to predict this is just simply based on primary sequence information so but what can you do you have this sea of information that seems to you noise so is there a way to deal with this so let's assume that you need to find a transcription factor binding site there's going to be something like the eg act of course you don't know that this is tgg hct and of course it's not always tgg act it can be pgca ct because transcription factor binding size like to play with sequence and actually this is the way they can change their affinity to a given or specificity the given sequence so then this is going to be your actual sequence that it can bind to now the so this is what you're ending up this is what you're looking for that you don't know that this is your binding site and you're trying to add constraints so this is one trick you say transcription factor binding sites are usually within 500 base prints base pairs from the start codon of a given gene and you also know that it tends to cluster in the same region so for most transcription effector binding sites you have more than one so what you might be doing is say i'm looking for certain let's say six days per long sequences that tend to cluster within 500 base pair of 80gs and then you're going to find something but still this is going to be very very weak you have way more layers way more information than uh and way more noise from which than than the one then the level from which you could expect the important information so even if you do all this you will find that many obviously transcription affected by an excited looking sequence they do not function as such well why we do not quite understand the end due to a higher level of dna organization what not and of course the problem is that you do not know what sequence to start with so what can you do you can hope that your statistical overrepresentation will help and one trick is provided by nature which is cross species conservation so you have extremely noisy you call them noisy but of course they are not noisy extremely noisy genomes human chimp mouse red yeast or whatnot and you're assuming and this is which is and this assumption is a fair assumption that you have a cross species uh conservation of important sequences so what you look for are there sequences that are conserved across several species and if you combine all this with some sort of uh smart tools like using architecture intelligence machine learning hmm hidden marker models were extremely useful to find uh for the prisma gene identification then you might start to see some pattern pattern emerging and this was done for yeast by alexander's group so let's just give you a concrete example that showed that this is actually a very efficient way to go when they sequenced four yeast species four very closely related sequence uh yeast species the image number of genes in each of them was about 5500 the reason they did it is because they knew that these were very similar species so what they found is actually there is a very high level of synthetic of genes so they found is that this shows that same gene is present at the same location in in all of these species the order changes sometimes the gene is lost gained especially either the chromosomes around telomeres or sub-parametric regions there's a lot of turbulence going on but for the most of the chromosomes things are or the information is retained to a large extent of course there is a slow and rapid evolution they found that for certain very important genes is 100 nucleotide conservation across all species for others there is a very low level of conservation probably that's something that nature can afford to experiment with but the bottom line is what they were doing is actually they found that important transcription factor binding sites are going to present the very same location across all species so what you're looking for here is that actually this gal4 binding site and this shows you the four different species and it shows this is at the very same location in all different species this is the total box the gap for binding site it's another tetherbox which shows that you have a very high conservation of important regulatory information now what you can do is actually turn this around and look for unknown information what you do is what they did is let's generate or they generated all random sequences which was x y z that means that x y and z stands for any of a t c a t c g and abc and there is any number of random atc's between them between 0 and 21. you can do this this is in within the realm of scientific computation this is actually not a law so these are any combination and you look for certain statistically significant patterns for this one of them is intergenic conservation are there any sequences like this when you go through all sequences that tend to be conserved between genes and intergenic regions you can check for integer versus genetic conservation or you can check for upstream versus downstream conservation these are all statistical benchmarks they found for non-transcription factor binding sites so what they found that for known susceptibility by photobinding sites all these you have those are more conserved in intergenic conservation you have a higher intergenic versus genetic conservation and upstream versus downstream conservation so don't so recall the problem you're starting with any random sequence you're just trying to figure out that any of these random sequences have any biological significance now even more importantly what they found is that if you when you start to find statistically significantly retained or conserved sequences then these motives were also reached in front of gene is the tended to share function which is very important because you're assuming that there are certain functional modules so genes that are tend to do the same thing has to be turned on or off at the same time for the same uh under the same conditions so that's when they came up with a long list of potential transcription factor binding sites in which all these things were kind of pulled together and they found that these are sequences that are tend to be conserved in front of genes that tend to share function and many of these actually were confirmed independently by experiments as new true transcription effective binding sites so the bottom line is that in these measurements either even even in discrete measurements means sequencing you will have to face a lot of sort of noise biological organisms were built a long time ago and the blueprints were lost if you knew how it was built then if we could figure out what's important and what not but everything was experimented a long time ago now so it seems to you now that right now the important information is is hidden in a sea of irrelevant information and it will be very difficult and usually it's impossible to find based on solely computational ground but if you look for help from actual biology in this case cross species conservation well then the important things the gold nuggets start to emerge okay and that's it any questions
2023-04-10 19:22