The future according to OrCam

The future according to OrCam

Show Video

Thank you very much will. I'm, so happy to, to. Be here to talk to you today, and to be talking, to, uh professor, chuchua, about. Uh. Orcam. And the various products they have. And. I certainly think that there's an enormous, opportunity, here as obviously. Uh professor, shishua, does. In this industry, to. Leverage, ai, and leverage computer, vision. To help enable. Better processes, and better accessibility. Uh for people with disabilities, people with vision loss, and people with hearing loss. So i'm excited to talk about uh this today, and uh thank you very much for being here professor. Thank you thank you matthew it's a pleasure. I think that there's. A lot of ways to start off this conversation, but i think a basic, one. Is, why. You see ai, as an enormous, opportunity. Uh, to, to leverage here, to, get, significant, advances, for people with uh, disabilities. That could benefit from it. So, when you look at the ai. It's playing out in two major, domains, the first domain, the first domain is machine vision. The ability, to see and interpret, the visual. Data that's coming, into the, coming into the machine. That, the last five years have been you know huge leaps, in the ability, of machines, to interpret, visual, data. And to reach performance, levels that are rivaling, a human level, perception. Uh the second one is understanding, acoustics, understanding. Sound, a voice to text and understanding, text this is called natural language processing. Or natural language. Understanding. This area, has, also received, huge leaps, in the past. Five years. So now if you put both of them together. And add to it and add to the mix also, the the rise of compute. The fact that silicon. Is becoming, more and more dense in compute, and then the power consumption, is getting, lower and lower. The ability, to put things, um. That could be a, very uh computer. Intensive. But we're now running on a battery, if you look at our smartphone. Our smartphone, today, is it's like a super computer, of 10, of 10 years ago, and it's it sits in our pocket. So if you put all these things together it's like stars, are lining in a perfect line.

That You start. Seeing a huge value especially, for people, who. Have disabilities. So. If you have difficulty. To see the machine, can see. For you and then whisper into your ear what you're about to see. And then you can also communicate, with the machine, by talking to the machine like we talked with cortana, alexa, and then, siri. Right so if the. If the machine, is is looking at the visual, field, and you want to know what's out there, rather than the machine telling you everything it sees, and you know and over, right giving too much information. You can start asking, a question what's in front of me guide me to that. Place, if i'm looking at the newspaper, read me titles. And then read me article number three or read me an article about biden, or you know you can you can start right exploring. The visual world because there's also. A natural language processing, interface. So if you put those things, together. And put it on a wearable, device. Then it's like you are emulating, a human assistant. It's like i have someone. Smart, enough. With the proper. Hearing ability, and proper, seeing ability, standing beside me, and i'm communicating, with this person. And this person is telling me about the world, it's telling me about what i'm seeing whom i'm seeing. Um i can ask i can ask this person, tell me when i see matthew, so i'm going from place to place, uh, he hears matthew, because i i don't see, right. Or i can you know uh tell him you know guide me to the next empty seat i'm going on a bus. And i'll guide me to the next mpc. So this human assistant, will tell you here is an empty seat, so now replace this human assistant, with a computer. And you can start appreciating, the big value. That you can get. And it all, you know all sits on a very very small, device, because of this rise of compute, rise of data. Rise of a machine learning, of. Deep networks. Natural language processing, natural language, standing computer, vision. All you know coming into a climax, in the last five years. Right makes it a huge opportunity, for helping people. Yeah and there's been, there's certainly, there's a lot of components, there obviously you do have. Um you have, the. Uh, advances, in. Natural language processing but also semantic, understanding, of the world as you mentioned, it's. It's one thing to say hey there's text in the field, i'm going to automatically, read that which i believe we've seen some, applications, of that, you know across. Voiceover. And across other technologies. But certainly there is an advancement, here in the way that, um. It it segments, the world semantically. To understand. If for instance, a paper is being held up closer to my face i want to read that, i don't want to read a sign in the distance, right and that's one of the advancements. Um that i believe has been added. To, um. To the orcam. Um. Camera, recently, right. So when we started with the orcam camera. You know the more calm was founded 2010.. So we're talking about, at least four years before the rise of deep networks. But you know we felt that you know computer, vision was. At the point in which we can do something useful, we did something useful, in terms of mobilize. With the, perception, driving, assist. Where the camera, can can do very very useful things and avoiding, accidents, by detecting, people detecting, their cars. We thought we can do the same thing, and let's focus, on. You know recognizing. Products, recognizing, money notes, we're organizing, faces, faces came out a bit later, but mostly, recognizing, text. An ability to do ocr, optical character recognition. In real time. In the year 2010. Was, what was groundbreaking. Right. But then as as as we moved, forward, the first thing we told ourselves well how would the user communicate. With the device. So in 2010. The way we communicate, with the device with the hand gestures. So if i want if i want to kind of trigger the device to give me information, i'll point, and the device is my finger. Or if i want the device to stop whatever it's doing i'll do something like this, so hand gestures.

And Then. Later, as as we moved, forward and technology, progressed, and and, compute. You know silicon, became, more and more dense. And algorithms, became. Much more. Powerful than they are today. Data became much more, available. Uh we said okay we can use natural language processing natural natural language understanding, in order to guide you know to, get a good interface, between the human, and. And the machine so now. I'm looking at the newspaper. In 2010. It will read the entire newspaper. But now i can say read me the headlines. And then i'll say read me article, number. Three. Or i can i can tell the device, um. You know. Um. Detect. Uh, detect. We will detect number 23. So while i'm moving, so as i say i want to room number 23, when i'm i'm walking along a corridor, and i want room number 23.. So stop me when you see the number. 23, so i'm communicating, with the device like i would communicate, with a human with a human. Assistant, or say when i'm looking at a menu. In 2010. It will read the entire menu, now i can tell okay find desserts, so it will read me only the desserts. Or start from fish it will read me something, starting from, from fish. So. The addition, of, first computer vision became much more powerful so we can do we can do much more than we did just. Back in 2010. But also the combination. With natural language processing. Made it very very uh powerful. Yes and i think there's there's a huge, key here, which is you sort of touched on a little bit here and there. But i think is important to talk about head-on, which is that, all of this is being done via on-device. Processing, with rcam. And i believe that that's, huge, for a lot of reasons one it's technologically. Impressive. But second, it is. Sometimes. Underappreciated. How. Intimate. Accessibility. Devices, can become, to the user, you know the, they're, letting them into their lives they're asking them very personal, questions, they're, you know opening up their world in a visual, and auditory, sense, uh with with you know acoustic, filtering, and acoustic recognition. Uh and, all of those, obviously, raise, privacy, questions, you know if we're we're basically, saying this is an extension, of my being. Privacy. And. Intimacy, and closeness, of that data becomes massively, important, and i think a lot of the other accessible, devices, or accessibility, devices, on the market. Um. Much of them rely. On cloud connectivity. And there will be single features here or there. That are local on device so can you talk a little bit about that road to. Ensuring, that you could do all of this. On device, on a battery, powered, unit because i think that's. One of the most singularly, impressive, things about working. I think that, early on what we understood that this, is the most critical. Uh. You know challenge that, that we need to face. First if we want a device. That, sees everything that you see. And hears everything that you that you hear and process, it. Uh, if you send all the data to the cloud. You'll be entering into huge privacy, uh privacy, issues. Um. We did, have. Uh two years ago, uh. Kind of a test in the water of such a device that was focused, only on face recognition. And we we went through a kickstarter. Uh, we we kind of shipped one thousand devices. That are focusing, only on face detection. Where the information, is being sent to the cloud because they're in the cloud you have all your databases, of faces, and so forth. And it hit huge privacy, concerns. So we we stopped with those 1000, devices that, that we shipped. Be processing. Private, information, in the cloud is very very problematic, especially when i'm talking about visual information, and auditory, information. So it has to be. The intelligence, has to be at the edge. Has to be uh, wearable. And this raises. Lots and lots of. Challenges. One, is is power consumption. Second, is, processing. Power. Right in the cloud you have unlimited, processing, power and here you have to have a very very, small amount of the processing, power because of power. Constant, consumption, because of. Size. Um. There's also practical. Considerations. So even if i wanted to send something to the cloud. Sending. High-resolution. Images to the cloud. Would not make a device, work in real time, as, it should. Right because the fact that everything is done on the device, allows us really, time uh, real-time. Processing, and and provides much more value to it, to the user, so being able to cram, everything into a device of this size and this is kind of the device for the visually impaired. Cram everything in device of this. Size. And have it work for a number of, hours. In full capacity. By processing, audio, and and, vision. Is a huge huge. Challenge. That overcome, had to face, had to face early on, and it and it pays back it pays back because people.

Feel That there are no privacy, issues. They can trust the device. Because nothing is being sent to the cloud there's no communication, to the cloud no communication. Whatsoever. Yeah and i think that, that trust is important, obviously when you're you know saying hey we're we're. Producing, a device for you. Uh that does you know enter into your personal, universe, but technologically, speaking. Um. You know i think that, the computer, vision is somewhat understood. By. Um. By. A larger percentage of the populace, than it was, say in 2010. Um people are starting to see at you know applications, of this in their daily lives. Um from the iphone's, camera. Uh to google's. Photos. Uh application. Which automatically, selects spaces, and and themes for people and things like that and visual, search obviously. Uh in photo libraries so they're starting to see some applications, of that, in their regular life and of course the microsoft, connect, had a lot to do with that, i'm kind of exposing, that to a mass audience, but i think it's much less so so far. Uh with auditory. Signals. And so could you talk a little bit about um the orcam, here and about the applications, there and the development of that. So uh when you look at that. Processing, acoustics. You know we we took it into two directions. One is for the orcam, eye for the visually impaired, being able to communicate. With the device as i mentioned before you can properly advise. Uh the second area. Was, taking, uh. Creating a new a new product line. That will help people who have, hearing loss, even mild hearing loss. And and the idea is is that there is an open problem it's called the cocktail party problem. Has been open for about five decades, in the academic. Circles. Uh which says. When you are in a situation, in which many people are talking at the same time. And you are having a discussion, with one. Specific, individual. You kind of tune in. Into what that person is saying. And you tune out everything else. And it's kind of miraculous, how do you do that you basically when you follow the lip movement, uh you know, you do something, like, quite miraculous. And, the past two years is the kind of the rise of deep networks. The the ability, to build architectures. That can, process, together. Acoustics, and video. One could do things that you know 10 years ago what would be considered, science fiction, and solving the cocktail party problem, is one of them so the idea is that, you have a black box which is a network. It receives, the acoustics. And receives the video of the person that you are looking at, now the acoustics. Would be the voice of the person that you're talking, with, but also, the. Other people talking, right you're on a cocktail party you're in a restaurant. Yeah having a discussion with the person in front of you, there's background, noise of, dishes, clacking. And and also the background noise. Table beside you there are other people talking as well. And all what you want you want to hear only the person that is in front of you tune out everything, everything else so this network. Receives, all this complicated, acoustics. Receives, the video of the person that you are looking at. Somehow, follows the lip movements. And then extracts, from that from the acoustic, wave, only the voice of the person. That you are talking with, and transfers, that to your hearing aid so it could be a hearing aid or could be just an earphone. It could be that you have a mild hearing loss you don't need hearing aids. But you would like at this particular, setting of a kind of a complicated, acoustic setting, of a restaurant. Put in an apple airpod in your ear. And. Amplify. Only the voice of the person, in front of you and tune out everything, everything else, and this is something that. Is not possible to do from a technological, point of view.

And We we kind of uh capitalized. On on several things that orcam was very good at one is. Taking very very advanced technology. And putting it on a very very small device. Kind of. Advanced technology in terms of computer vision in terms of natural language processing, in terms of, deep nets, running on a very very small device. And taking our, vast experience, in computer vision, and and. Natural language processing and natural language and understanding. That we did for the visually impaired. And building those architectures, that can solve the cocktail, party problem. And this is kind of the device. Or come, or come here the idea is that. You know you simply, uh. Put it on your neck. And then the camera. Is facing, the. The person that you are talking with. In this case it's just a normal earphone, i put the normal earphone and then i can you know tune in. Into. The discussion of the person in front of me and what this device does. Is the camera here, and has the number of microphones, the camera and the microphones, together. You know take the complex, the complex acoustics. And the video of the person in front of you. You know feeds it into kind of a miraculous. You know black box, neural network. And out and out of this neural network comes a new acoustic, wave. Which tunes out everything else except the voice of the person you are talking. How do you go about training a network like that. Uh, this was one of the big. One of the big challenges. As, a. It's one of our trade secrets, even, how do you go and i'm trying. To train such a network, so. But what i would say is that, one of the big advancements. In the past two three, two three years. Is is called the, self-supervision. Self-supervision. Unlike the classical, machine learning where data needs to be tagged. So say in. Say i want to recognize. Cars, in an image. Then i'll go through a process of collecting, training data, and then tag every vehicle, in that, in, that collection of images. And feed this into a neural network. Telling them here's an image as an input. And. This is a car, right i'm tagging it. And. This this, requires a lot a lot of effort, and and limits scalability. The past two or three years in in machine learning, has. Shown a lot of progress in self-supervision. It can where you don't need to tag anything you don't need to label, not need to label anything it started with language. Uh processing. And then shifted to, computer, vision. Unsupervised. Learning, or self-supervision. And and we. We. Harnessed. This new advancement. So that we can then take. You know the entire internet. Of clips. Wherever, clips, are, and use it for training, this neural network, so without tagging, without, labeling. Anything. So being able to train such a network. Is is really, a big part of the technological. Ability. Of a, of the company. It's not only putting everything on a small device, it's being able to train it. Right. And you know, to talk about hardware for a second though i mean i i think it's it's it would be interesting to hear your view i mean obviously, with. Um. With your first company. Uh or your previous company, and then, now, or came you're dealing with, um on device processing, that uses specialized, chips, and i'm just kind of curious about how you, you view the entire industry of specialized, chips specifically. As it relates to ai we've seen. Obviously, the biggest. Ai, and ml, applications. Obviously. But we've seen, the biggest, example, of this being you know apple, launching. A dedicated, co-processor. For. Um, for processing, ml, uh more rapidly and more efficiently. But of course nvidia is a big player in this space, um you know intel is a big player in this space now, i'm just kind of curious what your thoughts are on the industry at large i mean the. Advancement. Of specialized. Chips, seems to be. Driving, a lot of the innovation, around. These, devices. Specific, devices, or cams one example. Um, that take on. Tasks, that require. A lot of. Very specific. Actions, versus the generalized. Processor, universe. Of uh 10 years ago. Well you know it's a spectrum, and you need to you need to place yourself, optimally, in this spectrum, so on one one end of the spectrum you have, general purpose, uh, processor, like a cpu. And ease of programming. Very very easy to uh, to program. So this would be the cpu, this would be the nvidia, gpu, with the cuda, processing.

Language, It's very very, natural to to code, to write code for for this processing. On the other side of the spectrum, you have, very very hard to program. Very specialized. Architectures. Very efficient. In a very very narrow domain. So it will be say very efficient, in running a specific, type of a neural network. It will be much much more efficient than general purpose, processor. The problem with it, is. Very very difficult to program. So it's difficult to scale. And also because it's not general purpose. Say technology, is evolving and technology is evolving, rapidly. You know it's, every month you see a new architecture. Coming in take for example. You know. From 2014. To 2017. Or so. Uh convolutional. Networks. Was, you know. Was you know what was the definition, of deep networks. It was all around pattern recognition, computer, computing. Computer vision. All sorts of networks, architecture, around convolution. And then 2017. Of the rise of language, processing, new new type of architectures, came in they are called transformers. Uh bert. They came up for google, first and open ai, and then you know. Thousands, of you know academic papers, today. Around those new architectures. So. The problem is now if you if you have an arc if you have a silicon, that's very very specialized, for a certain architecture, and then technology, is is moving very very fast, technology, software is moving much faster than silicon. You can be in a problem, so, you really need to find a a point in this spectrum, in which. You're not you're not too much specialized. And you're not too much general purpose. You're not. It's, not so difficult to program, but all on the other hand it's not very very easy to, program. So finding the right spot on this spectrum. Is is kind of the holy is. Is the holy grail. So you're not. Pushing the boulder. On an architecture, that's that's too narrow, but at the same time you don't want to lose the efficiencies, of the general processing. Exactly exactly so for example that mobilize. You know mobilize, specializing, also on building chips. And those, silicon the system, on chip is is, has. Multiple, different, architectures. Some of them are very very specialized, to neural networks, others, are less specialized. So. Kind of to find a better position, in this spectrum. Of ease of programmability. And and compute. Density. Such that. Uh you can move with the tide. Rather than you know being locked into a very very specific. Architecture. And i'm curious to so what do you what do you view as the as like as. You know nothing is ever complete, right no no journey is ever complete but what, would you view as a a success, point or a major. Sort of end game for orcam, in general, what are you looking to do. Well you know, the end game for all come we call this ai as a companion. So if you look at our smartphone. This smartphone. Is is a super computer. But the problem is that it's in our pocket so it doesn't see and doesn't heal, right, so if you have a device, that sees and hear and has the computing, capability. Of. Of a strong machine. Then. If this device. Is aware of all what we are doing. Shares all the experience the visual experiences, and the audible experiences. We have throughout the day and has intelligence. It can be, it can create huge value. And and and the challenge here is is it's good to gradually, define what that value is. So, what organs, set out to do is kind of to take society, and peel it into layers. Where each layer, is. Where we can define. Very very precise. Value to that layer and then move forward so we started with, the blind and visually impaired. Because, the value there is evident it's very very clear you don't see the device can see. And will tell you what it sees so so this this is a great help, and then we move to the hearing. Hearing, impaired. The cocktail, party problem, uh create a much much bigger experience, much much, stronger, experience. In terms of, of hearing help. Than with the normal. Hearing aids, but then then you can go much much, further than that, you can you, since you understand that you know you're aware of the people that that you meet the discussions, that you have with the people. Say we're talking and then i mentioned okay. Let's uh, let's schedule lunch. Next week, so the ai can can do this automatically, without even letting me know so. Next week. It it can manage my calendar, and put a meeting between you and me and also communicate, with your calendar, with your ai. And do this seamlessly, because. It was observing, our discussion. It knew, who you are because of face recognition. It has natural language understanding, so it knows, what we are talking about and understands. Our. Our intentions. So the challenge here is got to define, this value.

And It's uh it's, it's difficult to define a value proposition. That is good for everyone, this is why we're doing this in stages. But we believe that as as the ai progresses. As the compute density, increases, as power consumption. Decreases, so that we can have such a device working for a full, for a full day. We can gradually, define the value proposition. Going forward until, we we, enabled. A value which is good for everyone, not, not only for people with disabilities. But for everyone. And this is this is really the end game for already for organ, wearable, ai. Yeah that makes sense. Uh and you know the one one, kind of example that we haven't talked about um. Yet. Is, or came read, right so the read. Um the read device. Is, not necessarily, for people who are visually impaired. Um but it mentions that it can aid people who have uh. Reading issues like dyslexia. For instance, i'm kind of curious like why why that. It seems like. It fits into this layering. You know that you were talking about about building out use cases. Uh in. Sort of tranches. Uh but i'm kind of curious of why that detour. From that. You know from sort of the visual. And, acoustic. Um, uh, paths of the other devices, so what the all kind of, it's not a detour, it's really a natural extension. Of the eye so that i you know it's like this. You know it's on eyeglasses. So. You know if you are blind or visually impaired you know putting something on eyeglasses. Is kind of natural. But if you are dyslecting. Or you have. Age related difficulties, in reading. Right. It's not necessarily, that you want to have eyeglasses, you don't have maybe something that you can, handheld. Point it. Uh press the button. It will take a picture. Of text, and then you can also communicate, through. Through auditory, you can say read me the headline, just like you did with with my eye or read me the entire. Text. Or find. You know the word facebook, and read me the text around the facebook, or you know. Uh start from, desserts, in the in the menu whatever we do to my eye, you can do with with this device that that you handheld. And then you open it up to a, new. Uh new. New parts of society. Not necessarily, people, are visually impaired but people who have difficulty. In processing. Reading, because of dyslexia. Because of age-related. Exhaustion, there are all sorts of. Syndromes. That over time, people find it difficult. To, read. Text and like help in in reading text and this simply opens up. The market. What we found out, is that even people who are blind and visually impaired, some, sometimes, prefer a handheld, device. Than a device that you can click onto to eyeglasses. Which is something that we did not anticipate. So uh the orcam breed is also. You know, sometimes, shipped to blind and visually impaired. Got it and it seems like there is, applications, there obviously it's, english language speaking, is like the first target. But it seems like there's awesome language. Awesome opportunities, there for translation. As well, right if it recognizes, it in in one language and can translate it to another. Yeah, but today machine translation, running on the cloud is so, so powerful. It would be much, easier. To uh, to, read to to kind of decipher, the text. Send it to the cloud translate, it and then send it back, so. Uh it. Doesn't seem to be a high priority. To put all this huge. You know uh technology, for doing machine translation, on a very very small device. In that case. Simply said because there's no privacy issue here send the text to the class, translate, it and then send it back. Makes sense makes sense would be a much much better route. Excellent. Well thank you so much i i really appreciate you taking this time to talk us through uh orcam's, offerings, i think that, this vision for sort of specialized, ai. Uh in, in these instances. Makes a lot of sense. Thank you matthew.

Have A good day. Thank you, back to you. Well. You.

2020-12-07 09:11

Show Video

Other news