- [Announcer] Welcome to "What That Means with Camille". In this series, Camille asks top technical experts to explain, in plain English, commonly used terms in their field. Here is Camille Morhardt. - Hi, and welcome to the podcast InTechnology. Today we're gonna cover "What That Means: Synthetic Data".
I have with me Selva Panneer and Omesh Tickoo, both principal engineers in Intel Labs. Welcome to the show. - Thank you. - Thank you, Camille. - Can we start by having one of you identify what is synthetic data? I'm not sure everybody has heard the term. - There are two kinds of synthetic data, one that is generated using programming models and another one generated using AI. And both have some rules while generating these data.
So they're not random datas, but they have some ground rules around how to generate those data. And these datas are used in many ways. - So to be clear, this is fake data. I understand there's rules so that we're trying to mimic reality, or realistic conditions, but it actually is fake.
We're generating it, making it up. - That's right. Yeah, they're generated. - What scenarios or what industries is this occurring in and why? Why are we doing it? - It's used across many industries. So if you look at today's AI systems, they are both consumers of data as well as producers of data. And we can have fake data used to build models.
For example, our AI models are very data-hungry, and we don't expect to find that data every time we want to build a new AI model, so people are using fake data, or synthetic data to train these models as well as generating this data so that, for example, if I'm building content for a media house, and I want to picture scenarios that may not be realistically possible but have to be as close to realism as possible. So there again, it's generated synthetic data that's being used versus synthetic data, that data that was used to build an AI model. So in terms of industries, anywhere where we need a lot of data, whether it's industrial applications, whether it's autonomous systems like cars or robots or whether it's media and gaming industries where we wanna use the data for different usages, that's where we see a lot of usage right now of synthetic data. - Most factories do pretty well.
Most of the products come out looking good, and so we have a lot of data of how it's supposed to look, but we don't do so well with when there's a problem. So is that the kind of scenario where you're generating synthetic data so you can train a model to notice when something is wrong? - Defects can come in any different sizes, shapes, and forms, and you really cannot train for every single kind of defect because you really don't know all the defects that you will see in the real world. So there are two ways to approach this. One is to say that I know all my good data, and I'm gonna use that information to figure out what's bad. But that only allows you to tell bad from good. It doesn't allow you to, for example, say whether there's a crack, whether there's a break, whether there's somebody who forgot to do something.
If you wanna get to that level, synthetic data really helps because you can actually produce those kinds of permutations and combinations, and train your model so that those things can be detected in real world without actually needing them to happen when you train your model initially. - So the second scenario I wanna ask about is, Selva, I know you've been working in film, so can you describe how we're generating, Again, we as the industry more generally, but how it's being used in the movies? - In movies, for example, visual effects, we need a lot of compute power today, and these compute powers are needed to render things that looks more realistic. And this is an area where the need for compute is much higher. And now we are using AI to see if we can actually render things much faster.
So these AI need a lot of data to train on. And so, in this scenario, synthetic data has been used to train these AI that can be used in rendering, to reduce the render time. And this is a big part of the movie industry, where they can actually render much faster by AI which are trained with synthetic datas. - So what are you rendering? A person, a background? What are we looking at? - It could be anything. It could be the background, it could be the person. I mean, if it comes to the person, it's much higher too.
And we need to do a lot of data. And this is where the lack of data, We can't really use real people in these data sets, so we need to actually generate those data using computer graphics, and use those data again to train AI models so that they can actually generate digital twin of somebody or even characters in a game or in movies. - So how is a digital twin being used in the movies? - AI is trying to bring two worlds together.
One is the computer vision world, where computers are used to see how our real world is and to perfect that, AI's being used there. On the other side, there is the computer graphics world. We basically trying to create this realistic world, and AI is again used there to bring this realism in movies and in games. So similarly for creating a digital twin, we need to have these two worlds come together.
And AI is the glue that is making this virtual world and real world come together. And that requires a tremendous amount of data to train these AI to blur the lines between this real world and the virtual world coming together. And this literally brings a lot of innovation in movie industry, gaming industry, all the futuristic metaverse that we're all heading towards. - I know it's being used in stunts. So where there were stunt people who were putting their lives in danger, this is one of the areas that it's being used to generate to mimic what the main character might look like. - Everything is done manually today, which takes a lot of time and energy to create those stunt sequences.
And this is where synthetic data can really help. And the advancement in computer graphics and rendering, we can now render, And we can't really tell by looking at two people, whether one is real, another one is being rendered, or synthesized using AI too. So the lines are really blurring, which means that now VFX artists can really take a rendering as a base, and use AI to make them look realistic in games or in movies. - What does VFX stand for? - Visual effects. - Oh, okay.
One other use case that I've heard of is autonomous driving. So how is it being used there? - So that's another great question. Autonomous industry is booming. They need a lot of data to train cars and to perform in all scenarios. Weather conditions, environment changes. And most likely we don't have a lot of data actually.
We can collect data, but there's always, We hit the limits, and this is where synthetic data can help, understanding where the potholes are, what action needs to be taken. And so we can generate these data today. Like I mentioned, we can actually render them and use them as data sets or we can even generate them, those visual data using AI, and feed that data back into this autonomous agents to see what action they need to take when such condition occurs. And so it's mostly the lack of data in any industry, especially autonomous industry. And there are too many variable factors here, like the weather, like I mentioned, road conditions, people, pedestrians.
What really needs to happen, there's a lot of uncertainty there. And data is gonna help quite heavily in those cases. And this is where we need synthetic data. - I wanna talk unintended consequences because it's exciting.
(laughs) I mean, when we talk about generating a human that we can't tell the difference between the actual human and the digital twin, or rendering of a human, or same thing for background environments, or we're training safety-critical systems on data that isn't real, what kind of concerns might somebody have about something like that? - Every time we build an intelligent system, there are concerns about ethics, bias and responsibility. And this is no different. Our AI systems are highly dependent on the data that they get trained with along with the model, architecture, etc. So we've seen those problems with non-generative data before, where if your data set is biased, what you get is a biased model, and hence the biased result. And that's no different here. If we build, If we generate data using models that are biased we'll get bias-generated data.
And if we use it for safety-critical systems or even anything that has a societal impact, it's going to show its effects there. So there definitely is the responsibility on people who are developing these models to be careful about that. The one silver lining I would say is that the factors that we do control many variables and how we generate synthetic data, which is different from how we collect data because sometimes the societal bias can creep into where we go to collect the data, etc. Having said that, there's still similar possibilities here depending on who is doing the job of generating these models and building them. So this is no different, in my opinion, than training your normal AI model, but here also, if you're not careful about how you build your generated model that's generating this data, you will get biased results, so one has to be careful about building these models grounds up in a way that it's tested for ethics and bias concerns.
And also as these develop, these models have also adaptability to them because they are generating, so they keep generating more and more data over time, so there is a possibility of feeding like loop back and saying, "Hey, if I can detect some bias or something that's not right from responsibility perspective, can I fine tune the model and keep making it better over time as I see more and more of these instances?" And, again, I feel the flexibility is there just because it's generated data and we have the control over what we can generate and what we can't. We still have to be cognizant about what we're building as we go into using this generated data for more and more applications that are not just movies, that actually affect autonomous cars or that affects a financial industry or things. - If there's an actor or an artist whose image can be generated, then all of the implications about identity protection and copywriting your identity or even your identity at a particular age in time or the sort of actions that you would do with that digital identity come up. - And we've seen some of those examples in the recent past.
I mean, not movie actors, but political players, whose statements were slapped on deepfakes, as an example, and shown as coming from them when they really weren't, and that caused some news as well. These things need to be investigated more carefully. I know that there are lots of approaches that one can take, and there are platform companies looking at certifying media that's being built on their platform. So if I develop something using a platform, I would also assign a certification to it, saying that it has not been tampered with. And that may not be there if somebody actually does switch the face with the voice on a different platform, and then you lose that certificate, and it doesn't propagate with the media anymore. So that's one way to do it.
The other way is to develop tools that can detect such manipulations. Depending on what was used to generate this data, you could actually backtrack and see. But it is a cat mouse game.
Like any security application, people tend to develop newer methods to generate new data and other people try to break that and do something nefarious with it. And it just continues from that perspective. But the problem that you raise is genuine. It has happened. And many researchers are actually looking at different ways to go around that, both from mitigation perspective as well as detection perspective.
- You brought up deepfake, and I'm wondering if you can talk a little bit about the relationship between deepfake or anything like that. Generative AI, static diffusion, whatever other, a kind of generative portion of AI we have along with synthetic data because one seems to be a particular ability to generate something and then the synthetic data seems like a volume vector. So help me understand how those two go together, and how they operate. - The deepfake is more towards detecting whether a specific image or video is being manipulated, whereas, You look at how the stable diffusion, all those generative AI is heading, it is able to create things beyond our imagination.
And it's able to do that much, much faster. And from an artist's point of view, it gives us a tremendous advantage. They don't really need to start from ground up, and they can literally use AI to set the base and then bring their creativity on top of it. In a way, I think the synthetic data or the AI helping the stable diffusion to create these imaginary scenes or videos can be a big advancement in creativity, where artists really can put their efforts and their creativity where it's needed most rather than spending their energy to bring the baseline into their art. So, in a way, I think the generative AI is actually helping, In my opinion, it's helping the creative community to get to the next level. Just like how the digital media enabled, from analog media to the digital media, enabled a lot of these image composition, video editing to be a lot more interesting for anyone to do, I think, in the future, AI is gonna help anyone to create movies or create plays much easier.
It's gonna be in everybody's hand in coming years. Just AI and combination of AI and the synthetic data is helping the creative to go to the next level. - Maybe if I could add one more sentence or two to that.
If I look at the scope of generative AI, as Selva mentioned, it's quite large. It basically can generate video, texts, audio, what have you. And that's the field of generative AI. We have models in there where if I give a text it'll give me a video back based on what the text I give.
So we could use it in movies where we're actually developing these fantastical worlds, and trying to provide content that we couldn't do otherwise. Or we could use it to generate human-like appearances of people that are speaking words that we put in their mouth. Both are applications of generative AI, it just turns out to be that somebody's using it for a specific purpose of generating deepfakes while others are using it to develop new content for entertainment or training these autonomous models or generative finding defects. So, in a way, these are different pillars of the same underlying technology domain.
- Okay, I understand. So you're gonna use synthetic data to make the deepfakes better, to make the generative AI better because it'll have more data to start looking at in order to get you to that next level. That makes sense. - In a way it is generative AI because you're combining two different things that you may already know of, but you're generating a third thing out of those two SCs. So it is in the domain of generative AI.
- Well, we're close now, or we're there now. Just to understand from you, we cannot tell the difference. People can't tell the difference between a human that has been generated from synthetic data and a real human? - It's very, very close. And today, even computer-rendered images or scenes are much, much closer if not almost similar to real capture from cameras. Generative AI is taking that to the next level, where the mix of rendering plus AI can take to the next level.
Plus we have control over those content. So there are things that where we can use it in a productive way, for sure, especially in the media industry. We can change the hairstyle of somebody, or all the things that are needed from an entertainment industry. I think this is gonna catch on. And today they're doing it more manually.
Plus it requires a tremendous amount of compute to do it. I think AI is really helping to bridge that gap, where content creation can be done much faster and with lot of varieties for the creator to pick one. All right. And so they're not to go after just one style.
They can go after thousands and thousands of style to see how they want to bring that entertainment module to the audience. - So what are some of the challenges with synthetic data? I mean, we already walked through some of the unintended consequences, but we're looking at how synthetic data is helping us bridge this gap between not having enough real data to train a model. What are the bottlenecks in synthetic data itself right now? - From the training perspective, the topic that you started with, I think there are multiple challenges.
One is the challenge that Selva was mentioning about platform-level performance itself. These are pretty heavy algorithms that need a lot of compute, memory and things thrown at them so that they can actually run in good enough real-time scenario so we can actually get our data when we need it. The second thing is that they are generating synthetic data, but they're also built upon something. And that something is generally exposure to the natural world giving them data about things around them, and then letting the uncertainty within the models go loose and try to generate different combinations and different types of scenarios based on the few things that you provide that model. So from that perspective, we still are limited in the sense that we could do a very good job of generating synthetic data for a very specific use case. Whether it's trying to generate for a car in California, driving there, a robot in a factory or maybe a movie actor or something like that.
But if we were talking about I wanna build a model that I can then give to a customer, and that customer could just take it to a factory floor, record the video around there, and then start generating data for me, we are not there yet. I think we still are working in very domain-specific silos in different areas. Very impressive. Wherever it's working, it's working really, really well. But the scale out is a challenge, and so is the platform performance.
- How does this intersect with probabilistic research? - A good question. So we've been doing probabilistic research for a few years now. The intersection is pretty natural. If you look at the natural world, it's fraught with uncertainties. Whenever we step foot outside our house, we have a final goal in our mind but we don't know how the world will look around us.
How do we react to it? Those are all things that happen, whether it happens to us, The noise in the environment happens to us, people running in front of our car happens to us. All these things are not something that can be rule-based. These are complete uncertainties in our lives, and that's where probabilistic computing shines because it brings in that element of probability and uncertainty-awareness to AI systems, which actually is very important to make AI systems intelligent because if there is no uncertainty, we don't really need AI. We can just build a rule-based system, and it'll be intelligent. Now, if I go to generator models, and I want to generate data that looks realistic, I got to have that uncertainty in there, otherwise it's not realistic. So that's where these two intersect because we're bringing the uncertainty from the real world through probability measurements and probabilistic computing, and we're bringing the power of AI to generate these new worlds and new models, and we're taking some things that this model observes, uses the uncertainty and then spews out things that look much more realistic, much more close to the real world.
And if I run the generator model twice, I may get two different results, but both of them will be realistic enough that they will appear to a human that, oh, this might be real data. - So, Omesh, for you, what is the most exciting aspect or terrifying, take your pick, of synthetic data? What do you worry about or what do you, Just you can't wait for? - Just getting there. When natural worlds and digital worlds combine, magic happens.
We can do so many different things. We started with media, but we can actually build digital twins for any usage that you can think of because now you have the control over dreaming of stuff and taking the real stuff, and bringing it together, and working in these bins to experiment with stuff before you actually take it out to the real world, knowing exactly what the consequences would be and all those kinds of things. So those are really exciting.
It actually, also, from just looking at it from an AI practitioner's perspective, it's a great litmus test for AI to claim that AI is truly intelligent. If it can generate data that looks real and can fool a real person, we are almost there. It's really great to get there. So that's very, very exciting.
The same thing is also very terrifying because now, looking at it from a perspective of something that has a life of its own and is generating worlds around us that are real, and we don't have the capability to differentiate between the two, I mean, one could think of so many things that can go wrong. And we have to be very responsible, keep an eye on where we are going, make sure that everything has checks and balances along the way while we get there. But I think that's true with any new technology, and this is no different here, it's just that the stakes seem to be a much higher here, but, at the same time, if we do 10% of that in my lifetime, I would be really happy. - Just gonna say similar question to you, Selva, and I know you have, We were talking about we have kids the same age.
So what would you tell them, knowing that we have synthetic data out there? - I think the merger of these two worlds to begin with brings the physical world into a virtual world. Plus you have tremendous opportunity to do anything you want in the virtual world. Things that are not bound in real world, we actually can do in the virtual world. And this virtual world is a combination of rendering and AI coming together. The way I see the next generation, whether it is the Gen Z or the Gen Alpha coming along, they would see AI as a foundational block for whatever they want to create. I think our current generation, we were using STKs, modules, libraries and plugins, but I think the next generation is up for a combination of programming models and AI, and to them, AI is just like another building block.
And these building blocks really need to be as rock-solid as they should be, and to do that, I think synthetic data is required. And so I think that the next generation, I think we are building the system for the next generation and the foundational blocks that we need. It doesn't matter which field it is, whether it is looking at real world or creating artistic world or all of that, we need to build all these foundational AI blocks.
And those blocks is gonna persist. And one other point from Omesh feedback is these digital twins or characters that get created, they're immortal. They're gonna stay forever. And so they persist in the world. And so that's another thing that the next generation, they may have things that persist forever actually.
And so I don't know how that's gonna evolve, but I think we are on the verge of creating those foundational blocks for the next generation to take upon. - That's a interesting perspective, that we would need to consider things like expiration dates on things that are generated. Very interesting.
Thank you both so much. We have, again, Selva Panneer and Omesh Tickoo with us from Intel Labs talking about synthetic data. - [Announcer] Never miss an episode of "What That Means with Camille" by following us here on YouTube or search for "InTechnology" wherever you get your podcasts.
- [Announcer] The views and opinions expressed are those of the guests and author, and do not necessarily reflect the official policy or position of Intel Corporation.
2023-02-18