Abstract Interpretation With Professor Patrick Cousot Lecture Series on AI 11 J P Morgan

hi everyone and welcome to the first joint distinguished lecture between the flare flare and air a research and uh flair and ai research are two groups at jp morgan flair headed by marco pistoia and i research headed by me manuela voloso and basically we both these groups both pursue research in uh technologies that are like at the front head of the development and research namely in quantum computing and all sorts of other techniques related to technology and ai as a service and as a discipline of research and engineering that would apply to financial services both marco and i are extremely pleased to have this joint distinguished lecture because it's in the intersection of the interests and many others will come in the intersection of the interests of our two groups so without further ado i just wanted to say hi and to just welcome everyone but i will pass to marco pisoya to really introduce our remarkable speaker today thank you marco thank you patrick thanks manuel yeah so we're very excited today to have a like such a distinguished speaker uh but professor patrick uzo um professor cuzo is a silver professor of computer science at the current institute of mathematical sciences at new york university uh before he was a professor at the ecole normale superior in paris france um the corpore technique and the university of lorraine and the research scientists at the french national center for scientific research at the university uh joseph fourier of grenoble france he received a phd in computer science and uh the uh phd degree mathematics from the grenoble alps university in france uh professor kouzou along with the doctor radia cuzo is the inventor of abstract interpretation which is the foundation of static program analysis they created this framework in 1977 it is one of the most cited papers in the history of computer science and really the foundation of every work that uh uses program analysis which is something that companies use every day for good quality enforcement security bug detection and so on um so we're so excited to have a professor kuzo here he was awarded um numerous awards like the silver medal of the cnrs in 1999 an honorary degree from uh the faculty of mathematics um an informatic of university of starland in germany uh like uh the ieee harland mills joint award and the um along with uh dr radia kuzo he received the acm sick plan programming languages achievement award uh i'm proud to have collaborated with professor kouzou and dr radia kouzou uh in the past and i'm so happy to be here today co-hosting uh with manuela uh this uh this talk uh by professor patrick uzo on abstract interpretation thank you professor kuzor thank you so i will share my slides and [Music] it always takes some time to see where are my slides here ah here that's it and uh i will try to explain the principles and gives you some applications and first i start by vocabulary because it's not always well understood what is foreign semantic verification static dynamic analysis abstract interpretation and so on so when you have program you have a syntax and you can represent the program in the language by some syntactic element like a syntax tree or things like that and the semantic is a formal definition of the execution of the program so if you give your program to a semantics it is a description of what will happen at execution often it's very informal but sometimes for for example ml there is a formal definition and there are many ways of defining this for example execution tracers or retrievable state and so on then to make verification you need a specification that is what you would like your poor hand to do and verification is a mathematical proof that the program semantics satisfy the specification so when you say by program satisfied specification that's wrong that is the semantics because the syntax does not satisfy the same specification and because you have recursion and loop you must do proof by induction and that's extremely difficult to automatize because you have to in induction you have to find an inductive argument and this is a real art of the problem how do you find inductive argument when you verify perhaps then another point is called undecidability it was proved by girdell and turing and many others in the 40s that any finite mechanical proof must fail on infinitely many programs by mechanical proof i mean something which answer is always correct and which terminate so because of this you see that it is a difficult and so people have work around first for example coverity the proof is incorrect so you can make proof easy if they are not required to be correct another case is decidable cases that is if you take a specific type of program like linear arithmetic loops with no test inside and so on there is an algorithm that gives you the invariant and unfortunately this covers a very small part of programming so what you can do also is to assume that everything is finite as in model checking and so you go out of memory and the time is almost infinite so you have no answer in deductive method people have to use interaction to provide the invariant the inductive argument because they are not able to infer it and so when we do static analysis by abstract interpretation the proof is always correct it always terminate so there must be something wrong here and what does not always work is that it is sometime inconclusive you cannot say if if it is two or four the specification is satisfied or not and so the the game is to have analyzer that are precise and scale to be conclusive as often as possible so dynamic analysis is completely different you monitor the execution at runtime so you instrument your program to check things at one time and the problem is that you can only observe what the execution at the time and sometimes like in dependency you have to observe two executions and so you cannot do that at one time you have guys that do symbolic execution that they put symbolic name instead of values and happily they you see that they cannot explore all pathetic in the program for example if a perhaps does not terminate they can conclude nothing by symbolic execution others do bug finding i mentioned coverity there is another approach which is also called modeshaking nowadays you pick us path and you try to prove that there is a bug so you establish a formula more or less by symbolic execution for that path and you prove there exists a value for which the bug will happen and you use smt solvers for that so the problem is the smt server answers nothing you have no idea what is going on and so for me these are not verification methods because they prove they don't prove that the specification is satisfied so in static analysis the only input is the program text and it is valid the the answer will be valid for all execution it will always be correct an abstract interpretation is a theory which allows you to abstract the semantics that you have defined into something which is able to make analysis and also verification method follow perform the semantic by abstract interpretation so the main objective is soundness that is what is proof is true sometimes we have completeness that is if something is true the method that you use allows you to prove it but maybe you have to find an inductive argument and so the machine will not do it but for manual verification method at least you have this property that the method that you propose allow you to prove something if it is true static analyzer are incomplete that is what is true might not be provable because of approximation but at least what is proof is always correct which is important and another stuff is that abstract interpretation provide a method to design the static analysis and this design is machine checkable and so for the moment we can check only small analyzers but at least we can prove that four small tools they are proved correct for bit one it's it's a limitation of deductive method but we have progress on this so i will try to introduce informally the main ideas so you have a new universe which is a concrete so in this universe you have elements for example each point is a program execution and then you have properties and properties are sets of prime execution for example by program terminates means all these executions here are finite inside and maybe this one is all by execution terminate in less than thousands steps and so this is not manageable by computer because it's an infinite mathematical object you cannot represent it in the machine and you cannot enumerate all the points so what you do is you introduce an abstract universe of properties which will have finite representations of properties so these properties i have this finite representation and these two are represented in the same way so you see that there is a loss of information because in the abstract these two properties are represented the same way we're having the concrete they are different and that is where i do an approximation i approximate this dark blue by the light blue now i have to establish a correspondence between the two worlds here and this is done by uh what is called mathematically uh a galway connection but it's just a map from concrete properties to the abstract ones so you see the two blues are mapped to the same and there is a kind of inverse function which say this represent this purple property this one represent this light blue property and so there is a notion of inclusion or over approximation if i have the light blue i can represent it exactly in the abstract if i have the dark blue i cannot because it has nothing corresponding but i can over approximate and to get this representation and when i come back by gamma you see it's an inclusion and over approximation so i have lost information but without losing information the only way is to enumerate everything and because i have infinitely many infinite objects i cannot do that okay so do this order correspond an abstract implication and the trick is to the theory ensure that if you reason here whatever whenever you say anything and prove it in this domain it will be true here so that's soundness and obviously in this case it is incomplete because you see i cannot prove this property it is not represented exactly but sometimes for proof method we have also completeness okay so the main idea is everything provable in the abstract will be true in the concrete so now i will illustrate the idea and i have taken a program with two variables program and this represent at time zero i have x and y that are at that point then my program do one computation step then x now is change to this value and y to this one and the next computation step gives me this value and each time i do a computation step i'll move to the next states so for real program you have a number of dimension which is fabulous and this is very complicated but i take a simple example and you see i am looping forever here i come back to a previous state and because this picture is hard to manage i will represent it in this way so you see i have the time and at each time there is a point here so this one is this one this one is the next one this one is the next one and i have made it continuous because a line is better than dots so this line is one point here then when we define the semantics you will have many points because you will have many input states or inputs or different files you are managing and so each data input gives you a possible trajectories and this set of trajectories is a semantics which is a program property so it's it's one thing in blue here and the semantics is an infinite set of infinite objects but mathematically i can reason about it then i have a specification which means in my example here no trajectory should go in the head in the head points so now i have [Music] an abstraction what i will do i will say i have to prove that these trajectories never go in their head so what i will do is i will over approximate the trajectory and put them in some envelope and here is my envelope how i have computed this envelope we will come back on it on this point problem next but for the moment imagine i am able to have an object you see this object can be represented in a machine because essentially these points here and with holes and they are the one of the points i have in the abstract this is one for hand property the property is all execution are in the grid so i can compute that and then i say look because all the trajectories are in the green and the green does not intersect with the red i have proved that [Music] my my specification is satisfied so it's two step first i take the semantics i construct the static analyzer the static analyzer takes a program and construct the green over approximation of behaviors and then in the second phase you check that it does not intersect with the red so now this is a miraculous case but we know by undecidability we not always well so when it works we are happy because we know all execution are in the green the green does not intersect the proof is finished and so it is what i represented here what is true here is true here the fact that the green does not intersect the red in the abstract imply that the trajectories here do not intersect the head now let's see other method when you do testing you try a few executions and if you are lucky you will find this one that god in the head but everyone knows that programmers are not lucky and often they miss errors and so you say that this is not sound because you miss errors if you do model checking nowadays people do bounded model shaking they say i will try all execution for 10 minutes or seven days and then i will stop and so you see at the beginning there is no error so they will claim i have found nobody but you see that later there is one and they miss it so now what can happen also is guys like coverity they say i cover but they forget some cases so in fact they have proved nothing because they have proved that maybe there is no error in the green but because some execution goes out some might go in the forbidden bone and you see an error and you will get no message so this will never happen with abstract interpretation based static analysis but what will hap will happen in this it is incompleteness that is you see i have all the hypoxia method but a bit too much and so now the gray the green gold in the head so what i must say is you might have an error here and i don't know if you have one or not because there are two situations the first one is you have a mirror because there is one execution that goes in the red but i don't know that i know only this so i don't know that there is one and the other is this one there is no execution but again i don't know it so this is called a false alarm and this is called the true alarm so in that case the only solution that we have is to say our analyzer is not precise enough we are going to do abstractions that are more precise and so the design of a static analyzer is a bit experimental you start with very rough abstraction core substration and you refine your abstraction until you are satisfied knowing that it is not possible to get perfect solution so i am going to go now to more specific detail there is a little part which is theoretical just to show that there is some mathematics behind but if you skip it's okay so what is a trace semantics it is a sequence of finite finite or infinite sequence of tracers that i have represented by the lines and you may have infinitely many of them or infinitely many of them for example if you have different input so the trace just records the state of the memory along time and the state is just what is in the memory plus what is in the control so the control consists of it depends on the language but usually you have a perhaps point you have a stack to record the calls to procedures you may have a more complicated thing with object languages and memory state record the value of a variable what you have allocated in memory what input you have done and so on so i will take a very simple case it's a y language it's a small subset of c and the state will just be a program point l saying i am at this point in the program and the environment rho here tell me what are the values of the variable have in this state so rho of x is 5 mean the value of the variable x in the program is equal to 5. and this semantics is defined by structural induction that is by induction on the syntax of the program so i define it for infinitely many programs by considering infinitely many cases so a base case would be the assignment so i have an assignment s which is a label a variable a sign to which i assign an expression and this expression has no side effect very simple arithmetic expression with variable constant basic operation plus minus and so on so what are the traces is prefix it means i just observe the beginning of all execution but it can be as long as i want so here i can have the initial state i am at l with any possible values of the program variable and you see i stop there my observation is just i start the observation i stop immediately then the next one is i start and then i do one step so after this step the control has changed it is after the statement and the value of the variable has changed because x has been assigned the value of the expression and i have some formal definition if i provide the value of the variable this function will give me the value of the arithmetic expression and this rho is the old one but now the value of x is v so that the base case very simple and now we move to a very complicated case which is iteration and for that we will define the semantics by solution of equation of this form x equal f of x and this solution are called fixed points so when you have such an equation you may have zero one or many solution and there is a famous theorem by tarsky that says if x belong to a complete lattice x and f is increasing then there is a unique least solution the least fixed point for this order on the so you have a partial order which in our case will be implication and you can calculate in some cases by iterating so you start from the smallest element you apply f you apply f this may go on forever but you can pass to the limit by taking the least upper bound and this is a fixed point so there is mathematics that ensure that when i write this this object exists and now is the definition of iteration so i will spend some time on it you see the semantic for the while is a leaflet point of this operator so that's my function f in the previous picture and now i define this function f and it takes an argument which is the prefix tracer that i have built up to now and because we start from bottom at the first iteration it will be empty we say i have built no iteration and so you see this is false because it's empty this is false because it's empty so these two terms here are false and there is only one which is possible when x is empty which is an execution can start at the bottom entry of the while loop and stop there so that's the same as the assignment it is the basic case let's now we have done x is this and we can iterate and the next iteration let us consider this case the next iteration x will be this tracers here i evaluate the boolean expression in the environment and it is false so what is the possible execution it is you arrive at l the test is false so you exit the loop and you never enter so this is this execution i arrive at l and then because the test is false i go after the while without changing the value of the variable because in my language i can have no side effects in boolean expression and now there is another case remember i arrive adele okay i arrive at l here and when i evaluate in the environment it is true so what i do i start executing the body and so to execute the body the semantics of the body i have already defined it because it is by structural induction so i defined the semantics for this and then i defined the cementite for that so i can say i know the semantics of the body and so the semantics of the body will tell me that i have a possible execution here which is your hive at the beginning with the same environment which was the initial one and you go inside the loop body so when you go inside in the loop body you can stop at any time and so here the semantic of the loop will be i arrive i go in the loop body i stop or you can say i terminate the loop body and i come back at l and so here you will have i arrive at l i execution i execute the body and i come back at l so now our x as either i have done no iteration that was the big mavic case or i have done some iteration and stopped in the middle or i came back and then again here you will have the test is four so after one interaction i exit or i do one more and so on so after two iterations i exit or i do one more after three iterations exist so i gotta go one more and so when i iterate this function i have longer and longer iteration iteration in the loop body it can be infinite so it will maybe it will go forever and when i take the least upper bound which is a join of all of them i will have the prefix execution so you see mathematically we can give a formal definition and this will be the basis of all our reasoning if this is false this is not really good then all our later reasoning will be wrong but you see nowadays we can take this and generate compiler that are proved correct from search definition for example concert is a project where a compiler of c is proved correct for some machines and you choose the definition of the semantics like this so if we use seven semantics our static analyzers will be correct okay so now we want the maximal traces so it is a prefix resource so that you arrive after the statement or there was a break inside the segments and so you go to the level where you break and there are also infinite tracers where are infinite sequences of states so that all the prefixes were in the prefix semantics so you see that to this definition of the while loop will be i iterate and at some point i exit the loop and i cannot go longer this will be a maximal execution all the loop goes forever and this will be an infinite sequence and this will be another execution of the loop because if there are inputs inside of the loop there can be many many execution so that the end more or less of the formal part i did it not to show that i am seventh but to show that it is possible to give formal definition of languages and for example for c there are some okay so the first abstraction is reachability so you see my abstraction it takes the tracer that i had before and also it takes a precondition that i have a bit forgotten here it takes a program point and it returns a set of values of variable at each point point so you see it says how do i get this environment at foreign point l i pick an execution any one i have defined previously what i forgot here is to say that it should start with an initial state in p0 than the precondition so i forgot it sorry so i picked all the tracers you have defined out before this one here and i i look at all the traces and in e stressors i look at all the time you go through this point l and i collect the invariant the values of the variable so what i get is when i am at point l x equal one maybe or another one would be x equal five or another one would be x equal ten etc and you see that when i have defined at this i take my previous definition here i apply the alpha and i get this one and this one generates the strongest invariant that is exactly at each point point what are the possible values of the variable and of course this is not computable so it's an intermediate step in my reasoning but from this i can derive proof method like floyd or logic if you know i can derive this from that but it's not computable so i will do another abstraction and we will do a cartesian abstraction so assume i have an invariant say saying x equal 2 and y equal 1 or x equals 3 and y equal 2 or x equals 5 and x equal 4 and y equal 4. so this is this red line and x and y live on this red line what i will do i will project on the coordinates here so when i have projected i know that x is between 2 and 5 and i know that x y is between 1 and 4 but what i don't what i have lost is the fact that y plus 1 is equal to x and i have loved the information that i am on this line and now the information is that i am in this box so the abstraction is very simple for each variable so remember this one a set of environment the possible values of the variable at some point and now for each variable i project and collect its possible values in r which i have written p here it's unfortunate i should have written r yeah so you see the idea i forget about relation between variables and so now we have sets of values but i can have infinite set that is x might be 1 5 10 and so on forever so what i will do is i will make a further abstraction i will take the minimum and maximum value of the set so now i have an infinite set of values here and now i have two numbers or maybe plus infinity and minus infinity and now this is representable in the machine i can represent that for each variable and each program point that finite and i can put that in my computer and compute the semantics here where i have transformed everything into intervals and this will not terminate we will see later why but it's a progress so when you design your static analyzer you will have many many many abstraction as tray as maybe 60 or 100 or it depends on the configuration you take and because it's extremely difficult to find the abstraction that will cover all your problems so what you do you say i take intervals it solves some problem i take congruences it solves some problem for example to know that x is equal to something modulo 4 is useful in c because you must have variables that are aligned on where double words and so on so if you know that the value of the pointer is some constant modulo 4 you know the alignment so if you have two analyzers with congruent third and interval what you can do is make a reduction that is you see if you are on one and this point and in this interval you must be in this interval and these points so this gives me a reduction here saying that i must be between these two values and this is more precise than that because one is impossible and more precise so this operation of reduction is really important because you create more and more complex abstraction by adding new one and making reduction so that you can transfer information from one to the other now we have fixed point and the idea of fixed point abstraction is the following you have a concrete function which has a fixed point here you over-approximate it and you compute the function of the over-approximate and you see that it is above the concrete fixed points so when we cannot compute this we will take another approximation of the function compute fixed point of the function and we guarantee that this is another approximation here so another representation is this one i have my function i pass to the limit i continue to iterate and i abstract to four points only which is very nice you see at this point they all go to this stuff and all these guys they go there so now when i iterate it will it will be a finite iteration and this point over approximately the fixed point here so they are correct they are not very precise but they are correct now when we have interval you can have zero one zero two zero three zero four forever and so we must have another idea which is called widening in in abstract interpretation and it can be linked to a newton iteration for example if you have a function and you want to create calculate its fixed point you iterate you start at zero you project the value f you project the value f you project the value and very slowly converge to the fixed point so if you do this iterates in a machine you have errors because of floating point computation let us forget that but it will converge extremely slowly so what newton proposes is say instead of using the function i will use the derivative and so instead of applying the function going to next point here i will use the derivative to extrapolate and when i have this condition i know by tarsky here i am that my x is another approximation and so that the idea of getting a fixed point or another approximation of exponent which is very important because it is an inductive argument and that's a way of finding inductive argument that we can use to do inductive proof and i let i take the same example of widening for intervals so i you can imagine that the analysis start executing the poem and the first point is this one and i abstract it in interval it is the same point then i do one step and i get this previous point here plus this one here so this two point that's not an interval so i have to abstract it into an interval or excuse me which here this one x is one and y is between y two one two then i go on and some other point imagine it's a loop so the next iteration introduce this point h equal to y equal to so now i have an interval and a point this is not an interval so i abstract it and i get this interval i my three points are definitely in this rectangle and i go on and i have this new point and i have this new interval and you see that it can go on forever by introducing new points so this is where i'm going to do widening i'm going to say i take the previous uh interval i take the next one which is this one and i do a widening that will extrapolate to get this this interval you see the idea of the extrapolate is very simple one is stable i keep it two is not stable i extrapolate to plus infinity one is stable i keep it two is stable i keep it and so you see it's an induction saying oh i guess that maybe the next point will be in this interval so you see that i cannot go on forever because if uh in the worst case i will get minus infinity plus infinity everywhere this will be a very imprecise but nevertheless point and so this idea applies always always applies and when you take example of static analysis like numerical properties we have seen these semantics we have seen we can cover it with interval this needs a widening we have seen that we can use simple conjunctions we don't need widenings for this one because the effect domain doesn't have increasing chains you can have octagons here which is uh more precise than interval and can be used to handle arrays for example when you have an array from one to n you need to know that the index is between one and end and octagons will allow you to do that when we work on control command you have filters you need elites for those sorry to do under them and when you are floating point computation you need to bound the errors and so that's another extra strike domain in all these cases but this one we need widenings and so you see this this can be very precise by combining them uh professor sorry for integrating just wanted to remind you that we have approximately 12 minutes so i am very late so we can handle symbolic properties like heap allocation we can do checking that is check specification like this one you see this one means when i am not at lx will be positive for a number of times then if i arrive at lx would be 0 and then i can be anywhere and x will be negative so that's that's something i can do and soundness i will not go along with this and i will go with two examples of static analyzer i started late three minutes i started let's scream in it so i will take five minutes the first one is andromeda and you know this guy who did it at ibm and what it does is security for web application in languages for the web it is very interesting because it's demand driven so it does not calculate the information and that check it check and if it is missing information it calculated when needed so it's it's precise and scalable and you check things that are bad on the web like your exercise or sql injection and things like that and i wore the co-author so but but marco and his team did all the work i just i just thought about it another one is astray which is used in europe in automobile industry in avionics in space in the u.s is

not really used so let's go up you have a nice interface so when i did that in my team at ens it was sixty thousand line of uh camel now it's two hundred sixty four thousand lines over camel not counting the in the interface which is also thousands of thousands of lines here so it check for division by zero out of bounds pointer that have bad arithmetic overflow it can handle parallel programs so you check for that rss locking which is bad uh it it used operating system so not all operating system but in avionics and automotive you have specific operating system and it can you analyze the interaction with this system and so on so it was evaluated by the american government and which objective is to develop techniques for evaluating software tools so they tried four tools and two were successful one is astray which is french and based on abstract interpretation not by my former student and this one is also french based on herself interpretation and done by my master student it's not uh it's not phd this one with phd and the other is they don't say what they are so you see they have test set and astray found and both found all all birds plus some false alarms astray much less than a pharmacy and interestingly they astray discovered birds in the test set that is they were provided they said they did not know some birds and astray found them though the guy made a manual analysis and now these bugs are included in the one you must find if you want to go to the vessels now you have to choose an alivers so the first thing is you avoid it because programmers are never el responsible so they can do anything it will be always great then the other thing is to take academic analyzers for example this paper is a recent one that checked model shakers on automative code or you see here about thousand or two thousand lines of sea astray or narrow analyze 10 million lines in one shot so you see the scale is ten thousand here and when you look most of them fail but they are very successful on the academic tests that are built to be a little poem of two lines that are difficult to analyze but it doesn't scale references there are two excellent books and at the end thank you thank you professor kouzou such a fascinating talk thank you very much okay you i will put the slide on online you can also take them it's no problem thank you so i think we have time but for a few questions um hi any of the attendees that would like to ask a question please use the raise your hand feature found in zoom looks like we do have one hand up currently it comes from carrie carrie please have me to ask your question hi peter how are you thank you so much for presenting today can you hear me okay yes yes speak slowly because my english is not perfect okay um i am wondering i appreciate your expertise and i'm wondering if um we were talking about symbols at some point i'm wondering if you could provide some insight on what is the symbol for iterative would it be uppercase sigma or does it really matter i i spoke essentially at two slides in symbol one was a picture of a data structure i don't know if you refer to this one and another one was you give a specification using a regular expression where you replace the symbol by assertions and then you can build another error that will check your specification if you speak of the second if you speak of the first the symbol are not just a representation of a very complex graph which is the sets of all possible memory allocation in your poem so one memory a location is a graph and because you change over time you have to represent this set itself by your god so it's it's rather complex you can read the paper i gave the paper and the other one was more oriented towards the checking for hand but instead of doing testing you just specify properties and you use an analyzer to tell you where whether these properties are true or false and instead of invariant i want to be able to represent evolution over time so either you choose a temporal logic where the formulas are just saying when i am at that point this is true of the variable and then i go to next point this is true the variable and so on but my observation is temporal logic is a bit difficult so i use a regular expression because most programmers understand regular expression so but you see a regular expression tell you a possible sequence of properties of the execution and you can check that very precisely if you want but you see the analysis will not enumerate contrary to body checking that's a big difference i don't know if i answered your questions i appreciate your insight yeah okay thank you i'm showing more hands the next hand up is from anne and go ahead and i'm mute to ask you a question hi professor thank you for the lecture um you were talking about dividing the input sets and um i am interested actually narrowing the input sets so that we can identify the sets that will cause a program to fail um yeah yeah have you thought about it where can i read more about that that's something i'm interested in that's why okay so uh i have not spoken of what is called backward analysis where uh with the analyzer that i have point they start from the beginning and they go in the future but you can do the reverse you can say assume i am i have reached this point here where there is an error and let us go backward to see what are the input condition that will create this error and so you can iterate this process and it will stabilize at some point then there are two cases either you prove that it's possible to have the error or it's impossible or you don't know so there are now now a number of things that you can do for example you can say i will cut my input domain in two cases you will do a dichotomy you you have an input interval where you might have errors you could in two and you do two analyzers and uh we have done experiment like this and it is more precise than many many other tools the problem is to scale so if your program are reasonably small that is one million lines okay if it is a billions lines it will not work but there are there are techniques and many people work on this technique nowadays it's doing progress slowly there are randomization techniques there are case analysis techniques it is a problem because when we were doing with the airbus for astray when we were finding a bug they were asking for a counter example and a plane fly for 10 hours if you have a bug after 10 hours your example counter example will be a trace of 10 hours with a point every 5 milliseconds so it's a huge object and so usually what we do is we don't go backward up to the beginning but we go backward where to a point where we are confident that things are okay and so if you are in a loop we will go backward up to the head of the loop not to the beginning of the loop and this proves to be rather effective you can also ask the user to help a lot by providing for example a regular expression telling where he thinks the error should be found this will narrow the the bandwidth of the caser that we have to check and so my answer is if you want one execution it's rather difficult but you in general you should have a little bandwidth the programmer is intelligent enough to understand at some point what is the problem that's the problem that's my experience thank you professor yes i have a way to narrow down the line number set to the one that interests us and i actually appreciate your point about not going to the beginning but going to the as we call less known good and going from there yeah thank you also if you have procedures and so on most there are techniques to infer specifications uh for the precondition post condition and then you can do your analysis within the body of the procedure which is usually not that big so up to procedures of 1000 line 2009 it will work the problem might be that the specification of the procedure is wrong but this will be another disease that will find it i'd love to talk to you more about it but i know you're out of time but thank you so much uh i have not understood if it is a question or not i i apologize i said that we are out of time i would love to talk to you more about it if you've done something interesting in my space i'm sure you'd be curious to find out um any way we can contact you later yes yes it's no problem that is you ask marco he knows my email you send me an email and we can discuss we can do a zoom session i can give you my room number i am available to discuss if you have more technical this was a very general talk so i did not go into all details but i can go if you want to discuss this separately and available i'm a researcher so most people think i have nothing to do and i am always so patrick i'm sorry i have to leave yes thank you very much for your talk and marco will say the final goodbyes okay that's okay bye-bye thank you thank you patrick professor if you still have time we have a few more questions but if you yes yes i can i can stay as long as you want oh thank you i was impressed by the number of participants which is a [Music] impressive i don't know if it is true but i have never had a class the maximum class i have had was 500 which is much less than we had today yeah we had over 806 people attending so it was really uh incredible but yeah we have other questions coming yes yes our next question comes from alexander alexander please i'm you to ask your question yes please uh professor kuzad thank you very much for absolutely great presentation i have question like few years ago i was working with third party company which used first audio logic proof solvers to analyze the code and which is kind of similar concept to what you have to using kind of like mass and analysis to analyze code what we found though that it works pretty well on low level languages maybe like assembler or even like c but for example in highly distributed uh type of programming like micro services or for example in my case that was scala based a framework using actor system it was really hard to use for system like that and find errors and find this kind of like conditions because of orchestration because it's like highly distributed the system is highly distributed uh messaging driven uh do you have like any thoughts about that yes so one problem is that we are dependent on the language because we depend on the semantics and when you take two different languages they have two different semantics and so it's very hard to do an analyzer that works for many kind of semantics in particular in parallelism if you take monitors actors semaphores or communications asynchronous synchronous and so on this time it's a different model so it's a different analysis third point second point it is difficult so researchers they do small problems and they don't go to big ones so they take lower hanging food first now it is not impossible we astray can analyze parallel code so it is statically allocated processors so the example i have in mind is a pilot interface in in the planes so it's the communication with the pilot and the systems you see the screens and so on so it's a three million lines of c it has 15 processors and it has many communication for example it has voice and sound and blinking stuff and screens that blink and things like that and you want to prove that when you have somebody speaking a message there will be no sound at the same time and uh when you are blinking screen it will be on two different devices not on the same and things like that and so on this program which is partly on return and partly generated automatically we had about the time thousand alarms so i was completely disaspirate i say southern alarm it's not manageable but when i discuss with the people they told me yeah but we have hundred engineers working on this project and a thousand divided by android is 10 and each engineers will have between 0 and 20 alarms and this will be manageable and so that we went on and and the problem was there was a big data structure where which represent the state of the system with semaphores inside and so on and so we are still working on that and i am all fooled that someday we will reach almost no for sale the trick is that this is for a mobile of parallelism that is uh most often found in embedded systems and it it will work for automotive it will work for planes satellites and so on and it will not work for say cloud computing because the model is different so what i think is unfortunate is that you have thousands of researchers they they work on small problems they publish a lot and they are reluctant to have projects that take australia in 20 years so you see it's very difficult to have funding for this long term and so that's that's a difficulty that is if you don't invest bankers they know that if you don't invest you have no return so that's the same in program analysis so you say the actor there are some analyzers but you see usually it's one guy that does that for his thesis and then it goes to the trash to have a long-term project is much much more difficult thank you very much for your answer actually i got very good answer and my experience working with this company like you say they want to go with something they know yes yes explore some like boundaries which uh kind of like new for them and we couldn't find like any grounds and like you say for cloud computing and distributing system actually your approaches actually might work uh oh yes yes yes you see i i i can tell you a story when i was young so uh after maybe five years of abstract interpretation i decided to work on parallelism and i started to work on parallelism and that at some point i got no money because the people at the time were saying yeah they built this parallel machine but by the morgan law you see two years later your mono processor is more powerful than your parallel processor so we are not going to spend money on parallelism and so we are paying now the fact that the deciders that 10 20 years ago were not able to anticipate that parallelism would be the main problem of nowadays and so there has been very few research that scale any sound that's that's unfortunate thank you that's great question and now with the time when we cannot scale up no more right because of quantum limit of the microprocessors and you have to go to parallelism there is no other way thank you thank you so much uh for your answer professor all right and we have another question from caroline caroline if you'd like to ask your question i understand that i hear nothing there is a mic not working is better not better [Music] my question is as a student of mathematics i was very interested in this problem can you recommend lectures that i should get into you already mentioned that uh i think the [Music] carolina can you actually mute yourself for a second uh you might be in a busy place but i think caroline was asking you professor kudo if you have uh any literature a book maybe that you would recommend for yeah at the end of my talk i gave one one book which is more for engineers and which is written by one of my former students so it's a good book so i would recommend this one and the other one is a book i have been working these last years it's not yet published it will appear in september but it's more for researchers so uh i can show my slide i will put the slide online we can actually get your slides later and find the reference yes yes so thank you caroline i'm sorry but i think maybe you are in the office and there were other voices coming through but i think i i got uh your question for professor kouzou right i am showing a couple more questions our next question comes from tom tom can you and mute to ask your question all right yeah um can you hear me yes okay excellent um i guess it's a question that follows on a little bit from the previous one which is we're often in the business of having to build highly distributed systems with dependencies of outside of our control as well and and certainly we've looked at using something like uh tla plus to help us understand and specify our our interfaces and and determine that our protocols are right but you seem to seem to shy away a little bit from temporal logic as a as a way of expressing these things how how do you see the work that you've done applying to these large-scale distributed systems that were kind of increasingly being driven to to both write and also reason about um so because we would like to be able to to validate correctness and then and then be able to assert that the implementation we have and the assumptions we've made about the underlying infrastructure is actually uh asserted by the by the model okay so this is a not a simple question so so when you speak of the tla plus or the b method or even b or method like this it is a method that proceeds by refinements that is you are supposed to have a specification and refine it if i need until you get a program at each refinement you are you have a proof obligations and it is guaranteed that at each level whenever you have a requirement they will be satisfied at the implementation level if you have proved the refinement concerns this is very nice for small algorithm that is if you are a small protocol of say 300 lines it will take you two months or three months to get it in this way but in the end it will work so if if you go to very specific algorithm i think doing program refinement is great and formally it is just an inverse of abstract interpretation it can be formalized in the same way now if you go to program data hundred thousand line or more that do that are duplicated on thousands of computers i don't think you will make it because it's extremely difficult to go through all the proof the proof are really usually much larger than the perhaps so if you have a program of ten thousand line your proof will be one million lines so you see you move the problem of understanding poem to understanding proofs so i would recommend refinement in the small in the large in principle it should be possible to have a static analyze analysis tool that offers some guarantees like your message are not lost or there is no global deadlock or things like that unfortunately no one has invested in shots too so you can for example if you use astray without parallelism you will be able to analyze perhaps in c or c plus plus whenever there is a call to communication or system you will have to provide some specification and that's it it will not be a global view of the system so yes your question is my answer is indo small do refinement in the large do analysis but there are no analyzers so try to find company that will build them it's extremely hard because there are no competent companies in software in uh in the u.s you are maybe one or two or three that are competent and most of them are commercial they for example one has an analyzer that works on 70 languages so when you take the intersection of 70 languages you prove nothing that is you take any 70 languages in the on the market and you will see that they are sufficiently different for that you cannot do the same reasoning for all these languages so i am a bit pessimistic and too old to start a project of 20 years now but my student in europe they are they are they are starting such projects don't worry in 10 years you will have solution that the scale that is uh it's so difficult that you cannot uh you see i my last student it takes took him seven years to do a thesis and he was employed next day by facebook so that was one on the market which is employed immediately and so to find pen because a project like this you need 10 you need a guy who has experience and failures previously to to understand what is important and not and to draw the contour of the project and to organize this it's really difficult and the competence is is not there unfortunately you see i am the only one giving a class from abstract interpretation in the u.s other classes you have one or two classes so people know it exists but they never go into the details and so they are unable to apply that's really a problem so what i would recommend is you take the problem which is the most important for you only one and try to find somebody who is able to find this kind of bugs and only this kind of bugs that makes a pro a model of programs that find a specific bug that is important to you and that's already a start and then you move on to more and more complicated properties people get experience and slowly you grow your team and you grow your uh the scope of your tool but starting small so when we did astray we asked for small programs so we got a simulator of planes which is a big pro-up then they told we told them it's too big so we said maybe you have a simulator of the simulator they said yes we have one it was too big so we said maybe you have a simulator of the simulator of the simulator yes they have one so this one we analyze this and we analyze the second one then the similar time then the plane it took five years but it grows slowly from problem we can solve to more and more difficult problem people get experience knowledge and i think it's a way to go if you say i want to check my cloud computing is too difficult you will not succeed you have to isolate and i spent three months discussing the problems that are important to the avionic industry before starting my little 500 analysis of the prognosis so it's important to i would say restrict the ambition to start with maybe you ask too much to these companies maybe maybe okay thank you very much all right i am showing a few more questions our next question comes from vincent vincent please i'm you to ask your question that's all in french we don't pronounce the t at the end so you don't say kuzat you say kuzo like merlot pino and so on okay that's so good i'm actually from canada so uh i definitely got a lot of all songs back in the day okay uh my question is um ass is partially in good humor but how do you how do you analyze the static analyzers how do you make sure that the the program itself is sufficiently complete so when we do that straight we had a theory so you have seen that it goes down to small programs for example interval you have the addition so my student they were doing proof by end so they say the interval of the parameters i will prove that the interval of the result is this one and they have done that a lot for filters for all the domains i have have done and the general structure was known to be correct because for years we had done that so we did not do a proof but the fact that behavior theory is a guideline and with new problems for example when you have a reduction it can destroy the widening so you do a widening and then you do a reduction so you have extended your interval and the narrowing will bring it back so next widening will extend it and the reduction by some analysis will bring back and it will not terminate so we knew this problem was existing and so we prove it did not exist in fact it did but we made a modification and then prove it did not so it was i would say informal proof like mathematician do they say they have a convincing argument but not a list of formulas that are saudi arabia and then a subgroup started a project where they built a small analyzer with interval octagons and so on the same structure as the stray with the layers of a domain that communicates for reduction and so on and they prove it with concert so the problem is that this smaller analyzer or the factor is 100 with the true asterisk so you see the analyzer is small we can prove it formally we have the the mathematics we have the prover we have the guy that do it to scale is difficult is the same problem as scaling proof that you do by any prover when you go big is difficult but uh contrary to many if you compare to compilers for example astray we have found very few bugs very few we the birds we have found one bug every month you see that that's very small that is [Music] and now it's so long time since we have not found a bug the bugs that are difficult are the one of imprecision that is your answer is correct but the bug is you could have done better because you have over-approximated at some point where it was not useful these bugs were difficult to find we had a few because people say oh i will not spend too much on this program i will over approximate it will be correct let's go and so it works and someday you find a problem where you should have been more precise so these birds which are sound but impressive are difficult and i can tell you the last bug i heard about it was a guy using the analyzer he said i have found a bug in astray so we said no problem we will do a meeting so we need a little extract of the pram he came with his extract and we thought about it and you say the problem is that the the bug which is reported by astray is a true bug it's your program and we showed it is bug so he run to telephone and discard the bug with his uh friends in his company it was a flying bag so it was important so i am optimistic that there are very few bugs in program design with a theory in the ad that you can use informally as mathematicians do now there are very few people doing that are you satisfied by song yes thank you for that thank you and our last question comes from ian and please let me ask you a question okay thank you um first of all thank you very much for the interesting lecture um it it really is is one of the the areas that i'm particularly interested in now you pointed out that the analysis needs to reflect the semantics of the programming language we would like to maybe design the next 700 programming languages to have semantics that are most helpful to static analysis excuse me to abstract interpretation and yet still have the largest domain of useful programs so with that in mind uh what kinds of semantics are more or less hazardous to that goal uh or how can you give advice to uh perhaps a language designer that wants to be able to handle non-deterministic concurrency dynamic behaviors um and so forth without making it impossible to tell if your program is good yeah so which kind of semantics i would say no one because in practice you don't have one semantics you have several for example you will have tracers which is one kind of semantics with interleaving for parallelism which is not so good but maybe you lose parallelism and then you will have another layer maybe which will be where you separate for example the semantics of procedures from the semantics of the poem so this is an abstraction that can be formalized by abstract interpretation then you will have to maybe you will want a semantics of function which is precondition and post condition to be able to analyze function procedures methods and so on separately then you if you have classes you will have to have a class invariant and so on and so this will be from your trace semantics initial level of abstraction that you can define formally and now when you go to the analysis so some problems some semantics will be useful for some others another semantics will be useful so in practice we have a kind of energy of three or four semantics and we build the analysis with respect to one of them for some property another one for some other property and because we have shown that they are abstraction of one of the other by composition we are sure that the they all underlie the same concrete basic property so that's one point the next point is to have a universal language that does everything is extremely difficult so you see if you have dynamic creation of processors you have distribution you have arbitrary complex classes in heritage if you put everything in a language i think history has proved that you you will fail and i would cite that example from ibm because man now marco is no longer at ibm so ibm when i was young said we have too many languages we will create one and they called it programming language one pl one so i studied pl1 and you add for example automatic conversion from one type to another so you add a matrix of 80 by 80 and you have a rules to go through this matrix to know whenever you have any data structure like a boolean how it becomes a complex number or an array of characters so you could do that and so this slug where they spend a lot of money on developing compilers on developing type system they created the laboratory in switzerland to have a formal specification with the vienna definition language and so on and in the end it failed because it was too complicated too many bugs programmers is not understood if you don't have an ibm machine you cannot run it and so they abandoned after spending i don't know how much the good point is that they learned the lesson they never tried again to have a universal language and the second point is vdl the formal semantic became what is called vienna definition method which is still used at least in europe to to for specification and perhaps development so it was not a waste effort i would discourage you to to do a universal language and if you want a language which is dynamic i have one of my students called antoine mine that has a an interpreter for i think it's pearl which is rather dynamic and uh like if you send me an email or marco send me an email i hide that i will have to give you the reference and you can try the analyzer it's for one language it's it's very different from other one because of dynamic aspects typing and so on and babies can be useful all right i suppose you're p cousseau at nyu.edu yeah you you send me a message and i will tell you the the pointers you need and the previous question on the book i will send it to marco a with my slides and i will have answered these two specific questions thank you professor so i would like to thank you really very much for this very fascinating uh and interesting lecture uh i'm sorry that we kept you here for uh an hour and 36 minutes but i didn't like it it doesn't happen often that we have the opportunity to meet somebody who knows like the bible of uh yeah obviously yeah now that i have a bird i can prove that i am an old guy no you you definitely like the top scientists in the field so it's an honor for us to have you here and to have to

2021-08-25

Show video