[Music] welcome to Google deepmind the podcast with me your host Professor Hannah fry now 2025 is coming and so is the era of agentic AI although of course you'll have known about that months ago if you've been listening to us so now listen in carefully as we tell you about the next thing it's called project Astra and it's a research prototype that is pushing the boundaries of what might be possible with a universal AI assistant it's an agent that is by Design not necessarily Tethered to a particular device or a screen or a keyboard this is right at the brink of The Cutting Edge and today we get to play with it now project Astro brings together all of the things that we've spoken about in this series memory Vision context reasoning real time interaction and someone who who can tell us a lot about all of those and how he and his team have managed to get them to work together is Greg Wayne director in research at Google deepmind Now Greg also happens to be one of the very first people I interviewed right at the very very beginning of series one of this podcast Greg welcome back hello Hannah let's start at the beginning then what is Project Astra project Astra is a team and a project and a prototype um aiming to build uh an AI assistant with eyes yours in a voice that's co-present with you so it uh is with you in any place you're you are so either through smart glasses or your phone or your computer it can see what you're doing it can talk to you about it like a little assistant that sits on your shoulder yeah yeah like a little parrot on your shoulder that's hanging out with you and talking to you about the world that's just smarter than you that in some ways yeah I suppose then in a way that that is different to like Google assistant or and I yeah um it's more embedded in the world yeah so the the the the sort of older version of a Google Assistant was kind of like a a a command and control system for you know your your home or your information you know um turn on this song on YouTube or something like that um and this is about being able to talk about the world um it builds on Gemini Gemini is the the intelligence underneath the hood along with some other systems um it's I think it's uh complimentary to the vision of Gemini in a way and works with it and maybe helps shape Gemini and is also shaped by it can we try out then because I think this is something you sort of have to experience to understand totally yeah okay I'm going to turn it back [Music] on hey Greg it knows you it does yeah I've had convers many conversation with it before um indeed we have spoken previously it is good to talk to you again I'm going to hand you to my friend Hannah for one second all right it's nice to meet Hannah I'm here whenever you need me hi project Astro okay hello Hannah can you guess what we're doing today based on the equipment and setup it appears you are recording a video or podcast in a studio is this correct it absolutely is okay I'm going to test you on a few things all right because I brought in a few objects is that okay yes I am ready when you are I look forward to seeing the objects okay all right let's uh let's St this is my favorite book project Astra can you see the book ah yes I can see the book it displays a graphic with a repeating pattern a motif known as MC fers Circle limit IV can you guess what the book is based on the image of the open page and the artwork on it it appears to be Fat's no okay I'm so I'm so impressed I'm so impressed okay I brought in that's a that's a a crufty old copy I know this I know it's it's you can tell it's my favorite book It's Been Loved it's been very loved over many years um okay can I I'm gonna try a couple of other things just to just to see if I can um really test you all right let's try this it's good to see you have brought out the nice furniture to complement the book do you require my assistance with anything else I have brought out the nice furniture project yes thank you okay here's um what about this do you know what that is it appears to be a model of a brain which hemisphere it is the left hemisphere of the brain model wow why why did you well I mean I had to do a lot of mental rotation for that too which way round is it oh yeah but got it right amazing um I might just try there's there's there's a whiteboard behind you that um has been there all I I've basically been looking at this for many months now so Astra if I show you that there tell me what some of the drawings are on the Whiteboard the Whiteboard contains a variety of drawings including a tree buildings and a series of connected lines and shapes there are also some musical notes and text bubbles I'm very impressed I'm very impressed let me just put this little guy back where it belongs over here so project Astra it's still This research prototype it's it's not available as a product that everyone can just download why demo it now why wouldn't you wait until it was ready I think it's nice to bring the public along on the journey in a way right I think people should get to know what's sort of being developed inside the labs and we're giving it to more people to start playing with and start adjusting to or giving feedback about now it's a co-creation process where it's not only sort of some kind of thing that's being cooked up in a lab it's also being cooked up in collaboration with a group of users you know around the world um people outside of Google that's important too so if we're going to make this really a helpful thing for Humanity then people need to start using it and telling us how they feel about it so have people been taking this out and about and trying it out in the real world yeah um we've had these trusted testers of people who are using it um who were just signed up to be kind of early adopters what people using it for people are using it for things like um getting like fashion advice from from Astra oh really in what way like what would match with this um uh so yeah Astra as kind of just like a like a partner like oh what do you think about how could I how could I have a fresher look here you know um oh wow I mean that's a very clever parrot it's a very clever parrot but then what about Hardware I mean at the moment as you say it's just it's it's it's on your smartphone but are we talking about like eventually in glasses yeah um but not only you know so I think when an earlier version of this project started it was really um trying to to tease out how useful smart glasses would be if an an AI was on them so in smart glasses it's the most kind of intimate and um in some ways amazing experience you've got this like you you feel augmented personally um like you're having a conversation with a smart version of yourself that's just sitting there and telling you whatever you want to know but the software stack effectively is agnostic to to how you're you use I mean there's specializations for each device but you can have it on phones or computers or VR headsets I was thinking as well actually as we were just playing around with it there's a potential benefit for people who are partially cited here or or blind too right yeah that's session of mine we've talked about the sort of AI is being co-present or sharing your perspective and sometimes you want uh uh another seeing and hearing uh you know kind of intelligence with you um but you don't always need one um uh so when are the cases when you want kind of a system that that can can see alongside you if you see but don't understand um or if you can't see M and so that's a whole category and there's a lot of people out there um uh hundreds of millions of people who have Vision impairment and what's the gold standard of help for for that population well it's having someone by their side uh who can help them out in the world um and this technology uh sort of is able to replicate that to a large extent we have kind of more nent ideas too about uh other kinds of disabilities so you could imagine um helping people who have difficulty uh scanning emotions and faces understanding that in certain circumstances um so so people potentially with autism could use this to help yeah you know I wouldn't recommend it as a prescribed drug right at the moment um but uh I think with further development it could definitely be um also for training yourself sort of since you could work on understanding phases and have Astra give you feedback back uh you know like hey tell me about this I remember sort of on a separate separate topic but when I was uh doing a home stay I was learning French One Summer and I couldn't pronounce certain words like the difference between the the word for Street and the word for wheel like L and L or I I don't I still can't do it right but I sat there with my like Homestay brother and I was just like trying to copy him for a while he's just like blew me off after like a few minutes he's like I'm not sitting there with you like Astra would be infinitely patient with you and yeah could help you with that kind of thing uh obviously memory uh so we have you know a system that uh has perfect uh in session memory we call it so as when the camera is rolling basically it remembers the last 10 minutes photographically um but it will also remember uh what you've talked about in the past that's why it remember as I'm Greg um and probably if we turn it back on we say remember who who was talking to you besides Greg the last time it'll remember Hannah um uh so this could be used for people with uh you know some cognitive impairments too um uh you know at some point uh I think one of the things that we're excited about too is this idea of proactiveness so uh it it it deciding on its own that you have a need um and then you know kind of channeling the response to that need without your actual need to give it a um a steer so for example it could be a useful system for reminding you of things going through the memories and saying oh don't forget you need to pick this up on your way home or whatever so you're not necessarily just proactively switching it on when you want to talk to it but it could be there in the background and then it bring up something when it thought it was appropriate yeah yeah so the idea is like you know you you are going home and it's like hey you know don't forget that you need to like pick up some Mar juice because you ran out this morning or whatever oh wow because it remembers having seen that in the morning yeah exactly yeah so I mean I guess at this stage this is like painting ideas of what's possible rather yeah that we don't have that yet but that's that's the kind of thing that we could build next yeah but you can see the the beginnings of it in this yeah I mean so I could easily say here's my fridge and oh no there's not much orange juice and then I'd say hey what do you think I should get at the supermarket later and I would remember that yeah but I would have to you know sort of give it a bit more of a contextual hold its hand a little bit more as it were yeah yeah um do you find yourself having to correct it a lot I mean do you do you notice glitches yeah yeah you know so one one thing it does once in a while is it says it it can't really see something that it can clearly see and so like you know you'll be reading uh like on on a bookshelf and you'll say like can you can you read the titles of the bookshelf and it'll say like oh no the titles I can't make and then you'll say something like you'll kind of do a Jedi mind trick on it you'll be like yes you can see and it'll be gu I can and then this is sort of a a weird limitation of the of yeah the agreeableness is something that you can influence is it like susceptible to encourage then yes really I mean hey it works for humans too a little bit of encouragement and suddenly you can do things you didn't think were possible um so so how what other kind of environments does it struggle with what about I mean it's quite quiet in here it's quite well lit uh you know there's not lots of busyness going on those does it work just as well in those kind of environments busy noisy dark perhaps in some ways uh operating in more environments is an important thing that we need to develop in particular noise conditions uh so as I said to you Astra really does here um it uh actually takes in the audio directly converts it into um a system that is the neural networks taken sound and code them as some kind of a package of information that is processed by the the language model uh Gemini um directly but um the system can't isn't really trained to identify different voices um so it will have trouble understanding your voice versus my voice when we're talking so if there's other bystanders who are having conversations Astro will pick that up as potentially uh the user speech um or it actually has a um a system that kind of wakes up and stop and listens for a bit when there is somebody speaking with enough intensity um and it will just sort of start listening to you know errant speech and kind of be confused if there's nothing directed at it so yeah noisy environments will confuse it when you say distinguish between the different voices as in like in the kind of the wave form itself so there is a there's an old problem called the cocktail problem um where which is the whole problem of uh what's more technically known as Source separation so it's understanding one uh uh Sound Source from another so if there's like a guitar and someone singing you could isolate that into two tracks the guitar track and the singing track um likewise you might want to be able to distinguish the um one speaker's track and another speaker's track um so that might possible to do in a within the single um modality or or sense of audio it would also be possible to do that in a um multimodal sense integrating across senses so for example when I know that it's you speaking I can also see the movement of your lips rather than movement of someone else's lips uh so ultimately you could imagine the systems would uh use all sorts of cues um even to change the way they perceive a sound because I guess this is this is in some ways the the thing that makes project H so difficult but also the thing that that gives it the potential because the cocktail problem as you say which humans are extremely good at you're in a cocktail party and you can hear exactly what the person next to you is saying despite loads of voices going on all around I have trouble actually actually you know what honestly so do I it's why it's the problem but broadly speaking humans are quite good at these things and when you have the audio only problem it's it's really hard to solve but but because this is multimodal because you have video because you have you have audio because you have you know the the the the text language model running in the background you you do have more Le to potentially pull here yeah yeah I think it it it should be able to resolve ambiguity with with more context how about different languages is it only in English at the moment and only with like a very clear accent uh it's mostly in English for me but uh it's uh no it's very multilingual so as a function of being uh Native audio it knows about 20 languages with with pretty high proficiency and you can switch between languages um even in the same conversation so go on give me give me a little demo of different languages hello Gregory it's nice to speak with you again oh um little redhe head so can you just hang on I did Russian in school can remember one remaining Russian phrase can you can you uh can you switch languages in the middle without necessarily warning it so could could I say for example you certainly could use that phrase but what did you wish to ask about this phrase what does it mean at what time does it open I asking about the opening time of submitting okay I mean is what what time do the shop open I think um but uh notable then that you're not saying now in English now in French now in Russian I think it's targeted to respond in the language that you started with so actually you said something in English and then gave the Russian I think if You' started speaking in Russian it would have responded in Russian but it as it was you it was thinking I'm speaking English but I am hearing Russian so you didn't have to change that but it if You' just started speaking in Russian it would have been maybe a little bit better but I mean this is different though right I mean from from the chat box that we have at the moment this is like you this this is an additional capability I'm actually really excited about language learning with this system like walking around and being like what is that having it teach you the same way that I was taught in um in school where we would bring in objects and talk about those objects in French class to like learn about stuff you know um being together and learning language I can imagine being lost in a foreign city and that being quite a helpful Aid exactly yeah and it should be able to understand other people speaking to you like quite naturally too so so if this is the thing that you're interacting with what's actually going on underneath the hood what are all the different components yeah so there's the first thing is there's an app uh and that is actually Gathering your vide it's and uh taking in your audio through the mic and so forth um and that's connecting to a server on which there are uh several different kinds of neural network models um like what so there's um a vision encoder and an audio encoder there's also specialized audio systems uh that are just responsible for um understanding when you've probably stopped speaking those are sitting next to um the large language model Gemini um and they are sending information from these sensory encoders directly into Gemini which is just responding we worked together with some of the teams in Gemini also to change the Gemini model to to be better at dialogue and audio processing so uh we've kind of improved its ability to use have audio take an audio and speak um when we started working with the models uh they were making lots of factual errors so we had to kind of identify ways in which we could um improve their factuality while also being kind of conversational uh that was one aspect of of our work on Gemini on top of all that though is something called an agent uh the agent is taking the video and audio in sending it to the model it's also calling search tools so either uh Google Lens or Google search or Google Maps uh when needed uh to respond to a query um so if you ask about the price of something it will call search um there's also a memory system uh that is uh being it's sort of part of the agent um and offline in between sessions the memory system will summarize relevant information um from the session about you and about uh what you've talked about in that session so those are some of the ingredients I mean I'm trying to imagine for something that that we use just to recognize a book right I I'm trying to think of like the number of different elements that are coming into play here because you've got the the sort of the the computer vision you got the voice recognition you've got the large language models you've got the the Google search sort of sitting underneath it you've got the agent layer where you're actually making decisions and you're doing all of that with like almost no latency at all in in the answers that it's giving you I mean this is like a phenomenally complicated thing yeah I mean phenomenally complicated of course you know as Engineers we come up with abstraction layers that we don't have to think about all the levels of complexity at one time um but I think overall it's hugely complicated you know the the data that's going into the models is is understood by very few people and exactly why it produces the results is probably understood by no one in a sense since it's just based on benchmarks well let me talk a little bit about the history of this because so so back in the first series of this podcast uh you were a guest on the very first episode and then you were drawing on inspiration from the Animal Kingdom for your research on intelligence and specific speically there was a bird um the Western scrub J that you were you were telling us about a way to as a way to inspire more sophisticated memory for AI let me just play you a little clip of it actually having a kind of large database of things that you've done and seen um that you can access and that you can use to um then guide your your gold directed Behavior later you know I'm hungry I would love to have some maggots right now where should I go find those that's that's the kind of thing we would like to replicate have you managed to hello project asra can you find some maggots for me I mean that sounds quite a lot like your orange juice example doesn't it it is a proactive memory example yeah yeah that's what you've done with project Astra yeah I you know I think that there's a sense in which like intelligence is really one thing right and one one has a career and one is studying what intelligence is one is taking kind of glancing hits at it kind of like trying to spar with it in one way or another and uh this project maybe is the the sort of strongest unification of all of the strands of research I've had in my life um although actually it's missing a major one which is that it's not embodied in a physical sense it can't act in the world um which I used perhaps yeah um yeah so uh yeah I think uh memory perception these have been long-standing interests and I think this is a way of bringing them together that people seem to also find stimulates them you know they they feel connected to it so how much of your Neuroscience background did end up inspiring project Astra so Neuroscience is used in two ways one is that there's a sense in which we're using Neuroscience to know when we've done a good enough job to think about like what does memory really mean and and have we achieved it yet and it's also just a bit of a propulsion like say you know if we want something that is compatible with us human compatible and some ways like us kind of maybe go towards a an embodiment of of intelligence that's a little more like us rather than kind of a a straightforward text interface for example I have been interested in the work of Michael tomasello who um studies human communication uh by comparison to the great apes um and he's really the maybe the main thinker behind this idea of of for me situated dialogue where uh he talks about the kind of like uh the basic premise of communication as being about two individuals who are in the same place who are directing attention in the same place and and therefore inferring goals together and then able to collaborate uh and that was kind of like what we modeled in in this technology so it's like the inspiration rather than necessarily at the kind of the the theoretical level rather than actually like directly copying the design of not not for the problem solving or the engineering per se then then I think you need to come up with different solutions that are dependent on the technology itself if project Astra links to things that we were talking about I mean literally years ago where did the first Spark for this project come from like when did it actually begin yeah so I think I know that so damis sabis the the the CEO of Deep Mind um kind of threw down a challenge to the company in a way which was um for us to think about what a Proto artificial general intelligence was which he what does that mean a a a Proto artificial general intelligence is a system that if we created it and technically minded people were able to scrutinize investigate it use it experience it they would conclude that the real deal something that is generally intelligent and a computational device was ultimately going to arrive it was a matter of why not if but that was left unspecified so there was a lot of kind of um creative thinking at the time like maybe it's this maybe it's this you know and so forth and um some people had had kind of ideas of of an intelligence arising the same way that Alpha zero wrote just by interacting with the world other people maybe had other ideas but my idea was very much about the sociality of intelligence so you know we are not very smart as human beings unless we learn from others or we learn from books which is the same as learning from others and that was the kind of idea I had for what Proto AGI would be then I thought also we can unify Proto AGI with the idea of helpful assistant whose main goal is the benefit of the humans it interacts with um so maybe those two things together gave me something of a of a of a direction to look and then it was when I tried to think about making It ultimately very natural I I sort of moved towards thinking about video as kind of the ultimately connective fiber of the systems were there were there big moments along the way where you had these big breakthroughs where we had big breakthroughs yeah you know I think so the there were these phases of the project so the the first phase of the project was basically a hackathon uh that uh where we we had two weeks of kind of making the first version and we have a video from that time it was quite crude uh but I remember Malcolm Reynolds who is a friend an engineer here uh was playing around with Astra he was going around a an office room and saying what is this and the system would say a plant you'd say what kind of a plant is this and I would say a plant wasn't super flexible I remember the first demo I ever saw had a 7c latency so you would say hi project Astra or whatever it was called then called yeah and then 7 seconds later it would yeah so it was very difficult to use it all because you would you would kind of think it had gone away right but it was just 7 seconds later it would come back to you I think one of the main discoveries of the time was basically that you know there's this idea of a prompt a prompt is the instructions you give to the system about uh that it needs for operation so it systems like this kind of really understand language they can read and you can say things to them like your name is Astra you're an intelligent helpful AI assistant um some of that information is inherent in the the Gemini models now um but some of it's indicated in our prompt and it wasn't really understood whether we could prompt A system that was multimodal before um very well one of the things that was kind of the a mind-blowing kind of insight or realization of the time was uh just telling the system that it could see the world through the user's camera gave it a sense of its own perspective on things like where the Providence of of this information to it it didn't understand that before it was always making mistakes like what when you'd say what do you see it was always giv me the wrong answer but then when we said like you're a system that is an AI that is seeing through the user's camera then it could understand why you know that this camera was something it was effectively seeing and it would answer correctly I mean there was a lot of work to do there but um realizing that we could effectively prompt it was maybe the even though it was a different kind of a system than what we' built before and you could use text to prompt it its understanding of sort of the situated or kind of more embodied that's so interesting when the gaunet was thrown down of uh create a Proto AGI were there people expressing doubt or skepticism that that something like this might have been possible yeah I mean know so it's it's it's hindsight is curious in AI because it moves so fast and people's perception of what is obvious changes so fast I think it's like obvious to a lot of people now in some ways uh which is blows my mind I'm like do you know the adversity how much convincing had to happen right uh um well tell us how much so I I think from many different perspectives people thought this was a an odd thing to do so from the perspective of could the systems actually understand the world at all like you know the the the vision systems in that era in terms of like the number of pixels they were taking in it was like 96 by 96 patches of image so like for those who don't know you know the minimum of our screens is like a thousand pixels by a thousand pixels something so you know very blurry you know input to these systems uh so no wonder it couldn't identify what kind of a plant it was it barely could see right the fact that they these systems would um really know information about what they're seeing rather than just being able to identify or classify what they're seeing so like having a deep conversation about something was seemed like probably a little too far ahead we didn't even have basic knowledge of like you know the amount of data that you need to get for systems perform at various levels so then I mean okay if all of this seemed so absurd and yet you embarked on it anyway were there times when you thought it wasn't going to be possible no no no no no it seemed always it could to be possible um there were times that I was maybe willing to give up oh really yeah I think it was it was I think there was a slow period before Gemini where things weren't working very well and it was like was hard hard times I think it didn't seem like it was a fruitful line of Investigation at the time for for some people but um I I sort of never wavered about the fact that this was definitely possible I think I had a much sort of more obstinate stubborn and kind of ultimately stupid way of going about it which was just like if I work on this for long enough it will definitely work so I heard that as part of the testing phase you have this project Astra room what what's going on in there what's in the room there is a special room yeah what what's inside the special room we have just like all sorts of fun and games in the special room there's a a whole bar there so Astra can help you make a drink right uh there's a an art gallery so you can Flash up different paintings on screens and walk around the gallery and ask questions about about art okay well let's let's dig into some of the stuff that's going on behind the scenes of Astro a little bit more latency I think is is a really key thing you mentioned a moment ago about the 7c lag that you used to get how have you actually improved that so it's on multiple fronts so we've improved the actual streaming video so it actually is sending information uh faster through the app um there's a sense in which these systems although they're trained together there's a vision system and an audio system this language model system that is getting the information from those two things what's called collocating them these are kind of a technical term but basically we're always processing images so as the videoos coming in for example into the the vision system it's always running as fast as it can and then it's sitting in the same place in the same cluster of computers as is the large language model so that it doesn't have to make a call across uh you know a country or a continent well so you so so sorry so they're running this kind of real time understanding of what's going on you have to physically locate the computer hardware that are running these models close to each other because that makes a difference absolutely yeah has that been the main thing then just moving where you're actually running the models no so moving where we uh we putting the models together is one thing making sure that we are uh caching the context so that the context of of history of what the system is interacting with with you is incrementally updated over time there's this idea of doing work with Native audio which means that previous systems uh had a a text recognition system a speech to text recognition system so they'd take in the audio then they'd produce a transcript then they'd call the language model which would respond to that and then you'd get a response this system is directly getting the audio in so it doesn't have to have that secondary system which also takes time or produces extra y so actually as a simple uh effect that's possible with Native audio is it can understand rare words or the pronunciation of words a rare word although becoming not so rare anymore is or name is damis aabis the old systems that didn't understand audio natively directly often thought I was saying Damascus but now it knows that it's damis aabus and it can use context to resolve that like the CEO of Deep Mind is Dam aabus another example that that somebody found recently which uh we have a little demo of is distinguishing between the word scan and the word scone which are two pronunciations of the the same bisc thing project Aster can actually you say what's the differ you know what's the difference between scone and sconet well have heard that you said a different word rather than just transcribing into the one word then the final one is that ke did a lot of great work on what's called endp pointing which is a very technical term but more or less it knows exactly when you have stopped speaking so it's it's very good at sensing like okay the user really done now so I can talk then there's something even more sophisticated which is that it plans a response even if you hadn't finished speaking yet oh um and it sort of it's speculatively planning that so it's sort of guessing this is what I would say you know and then when it figures out that the user really has finished speaking then it just sends it right off so it's already done it it's already figured out what to say before it's before maybe even you really know that you're done speaking that is so interesting because I guess actually a lot of the time people send sentences the important bit of their sentences you know can can be in the middle and then they sort of Trail off towards the end and you can use that time for getting ready with your answer yeah pretty much that yeah um oh yeah we talked about this stuff that actually we talked about that stuff three three years ago and then it seemed like oh that's too much and then it kind of started to work this year preemptively guessing what the answer is going to be before the conversation has has got to that point yeah it's hard you know we we pause for a long time in in our sentences right so the system that we have actually has to use some uh quote semantic understanding uh since it also has a bit of understanding of context and the sounds it's it's also hearing to guess when the user's probably done but also the reasoning that it's doing I mean even separate from reasoning whether it's finished a sentence or not do you think that project asra is capable of reasoning uh yeah yeah it's primarily Rising through its internal structure inside the neural network sort of in an unobservable or sort of like a very complex way um then there's the dialogue itself it's producing uh so it sometimes reasons through the dialogue so you can kind of hear it sounding Out and answer um people are also developing systems that have inner inner speech effectively where they're talking to themselves without talking to you project Aster at the moment doesn't do much of that but then I guess the advances that happen in the reasoning models are not need not be sort of like distinct from what happens in Project Astro like I guess guess the whole point of this is that it's pulling in everything so that you have this like ultimate Proto AGI as you called it yeah and in some ways I actually hope that it motivates some some maybe more vigorous work on some aspects of resoning so we have this great example of Bio Shu the product manager on product Astra pulled out Astra one day lunch and was like how many calories are on my plate and she had a very complex very beautifully laid out plate with like six types of food like some almonds in the middle like some a porkloin over there like some you know brussel sprouts or whatever and it was like oh you know kind of waffled a little bit but then she said keep a running total you know and how many are in these you know brussels sprouts and it was like well that's seven brussels sprouts therefore it's this many this many calories and then okay now add the pork line one of the things that was quite notable to me was that bio was handholding its you know thinking as you said it needs a little guidance sometimes but I don't think that we're very far off from A system that itself would just say well I see there are seven OMS over there this many Russel Sprouts there's a pork line all those together such and such so I think it's in some sense it's not good at that stuff because we just hav ever tried to build a system that could reason about that stuff now I want to talk to you a bit more about memory on that point about the the things that it's sort of recalling and and keeping in its mind as it were if you'll forgive the anthropomorphism I know that back at Google IO uh this could remember what had happened in the last 45 seconds right and now you've increased that time you can do 10 minutes now right yeah it's about 10 minutes yeah actually it's a little bit longer in some ways but uh 10 minutes is maybe what it should do on the 10 what makes 10 minutes the limit yeah so it's got um a basically a raw record of the last 10 minutes of video it works at about 1 frame per second so it's um uh got basically a stack of all the frames over time and all the audio that came in in between those frames for the last 600 frames or something like that the limits are really about the memory on the chips I think that hasn't scaled very much I think in the last decade or something like that the amount of this sort of fast active memory but so at the moment then it is effectively acting like a video recorder as it were keeping an actual record of everything that's happened in the previous 10 minutes yeah yeah I mean it's quite active it's able to use that information right away there's also a sort of a secondary system which is when you shut the the the turn the system off it will then take that conversation and summarize it and pull out relevant facts the most important bits yeah and it uses its own discretion to figure out what that is to like extract the gist of it as it were yeah but at the moment I mean one thing that it can do is recall important things from recent interactions can it yeah it's kind of got uh uh sort of a two- stream memory so it's got a memory that is both about you as a person it's got a kind of a a developing a kind of an understanding of you uh it's got It's effectively taking notes like oh they like ice cream that's chocolate ice cream whatever you know that'll be like kind of a list of things that's discovered about you and that's actually updated after every session too so suppose you say you know what I actually decided that I don't like ice cream anymore I really like cake so like forget that I liked ice cream um it will then say user you know says they no longer like ice cream and they like cake and those things are like kind of a stationary or static understanding of who you are effectively or what you're you like your preferences then there's also like this kind of conversational summary that's like on Tuesday at 8:50 like we talked about this game of chess um but then how does it decide which bit goes in which like how does it decide what's important enough to to be a thing about you that it remembers so it's got heuristics these systems actually are given heuristics uh so a heuristic is uh basically a rule of thumb for what to remember so one heuristic it uses is and we've told it to is if you ask it to remember something it should definitely remember that it's a pretty clear one so you know if I say you know remember my door code like it will do that because it will understand that's an instruction of of relevance um otherwise it takes a best guess you know it tries to say it's it's sort of saying has the user expressed any preferences that are interesting or that are different from the ones that it that the user has already expressed and then we kind of update based on that well let's talk about some of the privacy concerns here then how do you mitigate against some of those privacy concerns right so I think that one of the major standards is that of consent the users have access to their previously recorded data and they can delete it or uh see what is stored every time you delete something it reconstitutes a total knowledge of you oh it goes through the whole process of summarizing things it knows about you a new so the answer then I guess is is that the user ends up having some control over what it knows about it yeah about them rather yeah but actually so so in this podcast a few episodes ago we got to talk to um to to yasson Gabriel who's this ethicist at D is amazing um and he was telling us about the ethics of AI assistance and how they should be shaped in order to to to take into account lots of these these difficult questions how much has his work fed into what you've come up with with Astra we just fed his 243 page report into Astra asra said okay I got it did you no I wish I wish um yeah I think we've spoken a lot with yasan and we've done a lot of work with um a team that he's part of and they've been uh investigating both the model uh and the agent as the whole exploring kind of what it might do in different circumstances also working with them some external red teamers who maybe have fewer preconceptions and might do more more different kinds of adversarial attacks on the system we also have a layer of safety filters I mean this is for user harms um or uh for example if you s say certain things to it um or show it pornography for example it will uh kind of trigger these filters and not respond to that uh it'll also trigger on its own speech so it can't say certain things um although they trigger very infrequently anyway but I don't know yeah I think that the range of issues is quite broad uh fortunately we still have some time to to figure stuff out okay so what what then are your next priorities then over the next few months what were the main things you're going to be working on I'm very interested in something called proactive uh video work so that is to say um a system that can not only respond when you speak but can also help you in an ongoing sense so for example that's part of the visual interpret with the blind problem so you're walking around you can't see it will say you know oh watch out for the table over there you know it it can guide you in an ongoing sense um we're also doing a lot of work on more audio output what's called Full duplex so it will process um both it'll hear and speak at the same time which could be potentially annoying it could interrupt you but but it's also more natural conversation um as you're talking you might I might say uhhuh uh you know and that's listening and talking at the same time to it's part of language to confirm more on reasoning you said more more deep kinds of memory reflection of certain kinds uh when it calls tools to be able to do deeper uh inquiries and research with tools yeah there's there's just so many things to do better well thank you very much for joining us Greg thank you Hannah it is strange how quickly our expectations change about AI I don't know if you remember what Orel said in our last episode he said if someone had told him 5 years ago the things that would be possible he would think that we were already on the path to AGI and yet here we have this prototype of a multimodal agent it's one that can see that can hear that has memory and context and reasoning and multilingual real time conversation this is an agent that could at least in theory accompany you on your dayto day enhancing your knowledge supporting people with disabilities and augmenting our skills now of course AGI it isn't but it definitely feels like we are a significant leap from the kinds of systems that we were talking about even two years ago thank you so much for joining us for this series of Google deepmind the podcast we are going to take a break from here but if you want to catch up on any of our previous episodes then there is a whole array of deliciously nerdy AI conversational Delights in our back catalog for you to enjoy just find them on YouTube or wherever you get your podcasts [Music]
2024-12-26 01:47