[STEF 2022] 世界を感動で満たす3D-3R技術への挑戦 / Challenge of 3D-3R Technology to fill the world with emotion
Thank you for joining us at STEF Conference, for our challenge to 3D-3R technology that fills the world with emotion. I'm Kazumasa Nomoto, your facilitator today. I'm a Technology Fellow at the Sony Group R&D Center. The 3D in 3D-3R refers to three-dimensional.
3R is an abbreviation of Reality, Real-time, and Remote. Today, we will discuss how the value of contents in Sony's services can be improved by 3D-3R technology, and we will carry out this discussion with our panel of engineers. As I just mentioned, we will touch upon "filling the world with emotion," technology, and Sony's businesses.
As the relationship between these three elements are dynamic, I'll explain that in my segment. Let me start with Sony's purpose. It is to fill the world with emotion through creativity and technology. Today, I'd like to talk about how our technology can contribute to that vision. Next, regarding Sony's business operations, Sony is involved in the gaming, music, and movie sectors, and it is our business to work together with creators to create emotion and deliver that emotion to our customers.
Our entertainment technology and service business create products and services for creating emotion, such as cameras, camcorders, and so on, or for experiencing emotion, such as TVs and smartphones. Our imaging and sensing solutions business handles CMOS image sensors, which are, as you know, devices that are vital in content creation. These devices are deeply related to emotion.
Going forward, we intend to combine these technologies to provide services in the metaverse and mobility, and to sustainability. My explanation may still be somewhat abstract. So I'd like you all to watch this Sony Purpose video in order to understand our vision for our businesses. Thank you for viewing our video. I'm sure you were able to recognize by watching the video our involvement in many businesses, including gaming, movies, music, mobility, and sustainability.
And in order for us to increase the emotional value of these services, I believe there are three important factors. One of these is how to improve users' sense of reality through these services. Another is heightening the immersiveness of these services to increase enjoyment.
Finally, improving content interactivity to allow more active engagement. These three factors are the most important, I think. So, how can we improve these three factors with technology? This is all tied in to the 3D-3R that we're discussing today. For example, in order to improve reality, we must transform flat 2D content into more solid 3D content. and enhance the reality of the sound and visuals.
It's also important to improve the real-time accessibility. With immersion, improvement in sound and visuals is needed to improve reality. And these need to be experienced in real-time to ensure a sense of actually being there.
As for improving interactivity, this involves better real-time access of remote things as well as improving interactivity from 2D to 3D. I believe that heightening interactivity through three-dimensional interaction provides an outstanding new opportunity. To sum up, the four factors of 3D, reality, remote and real-time are crucial. I thought about how to classify the elements of 3D-3R technology.
One way is by the flow from content creation to customer delivery, and then having the customer experience the content. In the content creation stage, we have real-time simulation. Also, capturing the real world in 3D is gaining a lot of attention. Then there are signal processing, video processing, and audio processing, technologies needed to bring content to the customer with more realism. Here, reproduction of visual reality and audio reality is extremely important.
Finally, when these are delivered to the customer, the reality of the experiential technology such as the latest VR and AR, are vital. The end-to-end system is a very important technology in terms of improving 3D-3R seamlessly from content creation to interaction. And I would like to discuss these six technologies in today's panel discussion. Today, we have with us the top engineers who are at the forefront of leading these six technologies. Mr. Yutaka from Sony Interactive Entertainment
is well-versed in real-time simulation in content production. Mr. Tanaka from Sony Group Corporation's R&D Center is in charge of real-world capture.
Moving on to rendering, in charge of visual reality replication, Mr. Nagano, also from Sony Group's R&D Center. Next, on to sound with our audio reality replication expert, Mr. Asada, also from the R&D Center. Next, our expert in charge of R&D for creating realistic experiences for users through AR and VR, Mr. Mukawa of Sony Semiconductor Solutions.
And lastly, the R&D Center's Mr. Nagahama, in charge of linking these services through end-to-end system technology. Today, I have invited these six experts to have a lively discussion. Thank you for your time. Firstly, let's hear from Mr. Yutaka regarding real-time simulations in content production.
I'm Teiji Yutaka from Sony Interactive Entertainment. I joined Sony in 1988. I was involved with the development of the original PlayStation. And I continued to be involved with it for nearly 30 years. Today, I'd like to speak about real-time simulation from the perspective of the game industry.
From the perspective of interactive entertainment, in other words, the game industry, 3D-3R is something that is no doubt the foundation of gaming. When we create 3D games, 3D-3R is the foundation. Reality, remote, and real-time. The games move in real-time. There's a sense of reality. People from remote locations connect to play together. I believe that games are formed from 3D-3R.
First, let me talk about the original PlayStation, the origin of 3D gaming. After I joined Sony in 1988, my boss was Ken Kutaragi, who is known as the father of the PlayStation. I was assigned to his team. I was the eighth member, in fact. From that time onwards, I learned many things from Mr. Kutaragi.
One of the projects I worked on, which came before PlayStation, was to work with sampling synthesizers, which record raw sounds, transform these sounds into different pitches, and replay them. So, with devices like these, you can record your voice once and make it into a choir. They let you play around with sound. Mr. Kutaragi saw these samplers and said, "If you can do this with audio, what about video?" He wanted a way to record and reconstruct the world at will. That concept of digitizing and reconstructing the world at will was what eventually became the PlayStation. He wanted to do with video what was possible with audio.
This was how the PlayStation began. There were doubts about if we could do with video what we could with audio. Eventually, because the world itself is three-dimensional, we began making video in 3D, which led to the birth of PlayStation. PlayStation was the beginning of 3D entertainment and content.
Ever since then, PlayStation has evolved, and currently, PlayStation 5 is on the market. When we analyze the evolution of PlayStation, we've pursued realism consistently since the original PlayStation. The pursuit of realism started with very crude graphics, that eventually became more detailed, and capable of finer movements. The pursuit of realism is the history of PlayStation's evolution. What we needed to realize this, was real-time simulation technology.
Simulation is essentially calculating movement based on real-world laws. By doing this, as the world revolves around the laws of physics, we can create graphics with correct movements based on physics. So, we pursued this reality through simulation technology.
Calculation is needed to create this simulation technology. And it takes a long time to calculate. From the first PlayStation until the fifth, calculation power has increased and we've been able to pursue realism in graphics further. I would like to talk about two of PlayStation's technologies. One is physics simulation.
Physics simulation is all about creating motion based on the laws of motion, through solving the equations of motion. In other words, realistic motion can be made through physical calculations. For example, when something explodes, hundreds of fragments scatter. It's impossible to recreate this manually. But you can recreate it with physical calculation. So, to create this motion, we have to calculate each one.
It takes a lot of computation power, meaning it takes time. To illustrate this, look at the image on the right, with a picture of a ball colliding with a rabbit. If you can't do many calculations, you have to calculate based on the squares surrounding the ball and the rabbit.
These squares are called bounding boxes. Now, if you have computational power, you can make the square smaller, which allows for more precise attacks. For example, if you make the ball smaller, or make it rain, you can calculate what happens when it hits the rabbit. The simulation becomes more accurate due to computational power and you can create more detailed movements. Next, let's move on to ray tracing.
This is also done using physical calculations. Ray tracing is a technology for creating visual images. Humans create images with the light that enters their eyes. By tracing and calculating that light, we can create physically accurate images based on the light that would enter the human eye. As you can see in the diagram to the bottom left, each ray of light that enters the eyes, is a pixel.
We can calculate where each one leads. But there's not just one ray of light. As you can see on the right, light emanates from a source, until it bounces off something and then basically disperses. One ray becomes several hundred or thousand. So, the light is reflected many times on its way to the human eye. So, actually the number of rays is infinite.
It takes a lot of innovation to calculate infinite light rays within the limits of real-time. We first sample, then calculate the important rays. If there are not enough rays to calculate, you get noisy images. Often, if you take a photo at night, the photograph will be noisy. The phenomenon is the same. You get a very noisy photograph. So you need to make a way to calculate a lot of light.
In other words, you have to find ways to increase computational power. I'd now like to show you a video of us experimenting. This is how we do physical calculations and light calculations simultaneously in real-time.
On the bottom left are images of the real world from a camera. The large images in the background are images made by computer graphics. Let's play it. As you can see, the computer graphics move in tandem with the actual footage.
And above these, you can see the physical calculations being made. In this way, you can observe behavior, like collisions, taking place. These are all made possible by physical calculations.
Here, we're integrating the kind of texture we want. The texture can be metallic or rubber. It's up to you. You can simulate how these change the way it reflects. This is one technology we've created. And to finish up, I'd like to consider the future of simulation.
As I mentioned before, running simulations on everything is both costly and time-consuming. On top of that, we have to run the simulation in real-time. So, I've been thinking about a fusion between simulations and the trend of the moment, AI. Of course, we run simulations when we can.
But when it's difficult or time-consuming to express through simulation, we can use AI inferences, or extrapolations. And by merging simulations with AI, we can create graphics efficiently. For example, we're working on this with Mr. Nagano. He'll talk more on this later. In ray tracing, if the sample size is too small, you get a noisy image. But thanks to AI technology, that noise can be removed, successfully recreating the original beauty of the image.
So, we can fuse the technologies of AI and simulation to move images in real-time. This is what I'm working on now with Mr. Nagano. And that's it for my presentation.
Thank you so much, Mr. Yutaka. My background is in physics, actually. I always thought of physics simulations as simulations for the real world. I'm excited to hear that this is going on in the virtual world. It's like I'm back in university. And in real-time, no less.
Thank you. I'd like to throw the discussion open and maybe take any questions. Or perhaps I should nominate someone to speak.
How about Mr. Tanaka? You were laughing. The subject of this panel is 3D-3R, which involves the 3Rs. You talked in some detail about reality and real-time, but you only really referred to remote in passing. How does the remote element come into the world of PlayStation games? Well, the original PlayStation came out before broadband, before the Internet was widespread. So, at the time, we pursued the elements of real-time and reality. By the time PlayStation 3 was released, broadband was widespread.
So the PS3 had a broadband connector built into it from the beginning. The remote element really changed gaming from the ground up. Gaming is basically digital entertainment, so it's easier to make remote.
You can filter everything through the network in bits. The first thing that comes to mind is competing with people in remote locations. We have the technology to ensure competition in real-time without delays, meaning that many people can play together remotely. Also, thanks to digital networks, we can do content distribution and use remote in other ways.
We constantly think about new functions that incorporate the remote element. Let's go for another question. How about you, Mr. Nagano? You were just mentioned. Just now, Mr. Yutaka explained that the original PlayStation was able to run 3D CGI in real-time.
In retrospect, that is pretty amazing. How were you able to achieve that? At the time, even personal computers weren't able to deal with moving images. After all, it was 1994. And at the time, 3D CGI workstations cost around 20 million yen. Silicon Graphics released one.
It was all about developing one for a reasonable price for the home. The linchpin was semiconductors. By incorporating them, we were able to jam the necessary functions for 3D calculations into the semiconductor. We brought the price down by using more semiconductors.
It's no exaggeration to say that we were able to bring PlayStation to market thanks to semiconductor technology. I see. Thank you. I'd love to continue this discussion, but let's carry on.
Next, let's hear about real world 3D capture from the R&D Center's Mr. Tanaka. I'm Junichi Tanaka from the R&D Center. As we stated, our purpose is to fill the world with emotion. So I'd like to give a simple overview on the subject of capturing the real world. It just so happens that I work in the area of spatial image technology that Mr. Yutaka brought up earlier.
And what motivates me are the words of Mr. Kutaragi, said to be the father of PlayStation. He spoke about the "digitization of the world." Mr. Yutaka just mentioned this and I really like this phrase. His words moved me and motivated me to digitize the world. And what I heard at the time was that PlayStation was made to digitize the real world, not just to create a virtual world. That's what 3D CGI is all about.
And what Mr. Kutaragi stressed was the importance of interaction in real-time, not a simple record of it. At the same time, it's not just about interaction. It's about the value of the data exchanged in digital space. The fact that intangible things could have value impressed me greatly.
This was the same concept as what we now call the metaverse. I was really impressed by this and in order to digitize the world, I was drawn to capturing space. It is still a main theme of my research. You just mentioned the time when PlayStation first came out. CGI was still being created from scratch, and they were very low-poly. But more recently, CGI has approached the way the real world looks. But this isn't based on one particular technology.
We're able to do this thanks to the evolution of many technologies. One big factor is sensing devices such as depth sensing. These have come quite a long way. Another factor is photogrammetry, that creates 3D from photographs. The evolution of signal technology was helpful too. Then there's also the emergence of cloud computing.
This made computational power skyrocket. Up to this point, we were just collecting tools. But now, we have the emergence of AI.
Thanks to AI, we can optimize beyond human intelligence. In the past, we only managed to construct models through this technology. But now, we can optimize them fully. So 3D processing has moved from an almost academic research level to an industrial field integrated into society, thanks to a combination of these four technologies. When it comes to research, there are many kinds of 3D research being conducted at Sony. And my specialty is to totally capture the real world.
And it's not just about capturing. There's capturing, displaying, and interaction. All of these three need to be integrated. And it's these three elements that I concentrate on in my research.
As we don't have much time, I'll focus on capture today. So, what kind of capturing are we doing at Sony? Broadly speaking, there's the capture of humans, and there's the capture of objects. I'm making two endeavors in the capture of humans. One's known as volumetric capture, and the other is digital humans. These are actually contrary methods.
Volumetric capture transfers 3D data into video. It's also called performance capture or 4D capture, since it captures the dimension of time. It's filming things the way they are. The main benefit is photorealism.
You can capture things like the movement of clothes. On the other hand, with the digital human method often used in games, you are able to accurately and delicately create a 3D body, along with hair and clothes, and use physical simulations. And as Mr. Yutaka just mentioned, graphics are created using both simulation and animation. These two methods are polar opposites, but each has its pros and cons. So I use a combination in my research, as having both would differentiate us.
Regarding the capture of sets, there's Sony Innovation Studios at Sony Pictures. They have a volumetric capture system there that captures over a wide range. And they don't only conduct capture. They do what's called virtual production, where they film with the backdrop on an LED screen. This creates parallax, to allow a high-resolution capture.
This is one way Sony is bringing academic technology into industry. And we're at the level that we can use it in the process of creation. And in terms of the 3R, our research and development is focused on the areas of reality and real-time. Sony Pictures and others use this technology in order to support creators in film production. And I want to continue to provide this through the development of technology. This was just a very simple explanation. But I'll stop here.
Thank you, Mr. Tanaka. The fact that you can capture real-world objects, people, and environments makes me feel like I can be cloned. It's very thrilling. Thank you. I'd like to move on to discussion here. And I'd like to nominate someone. How about you, Mr. Asada?
Listening to you talk about 3D production, I get the impression things will change. How, exactly, will it change? Well, the most important thing about capturing something in 3D is that you can use what you've captured as assets. Of course you can use footage shot in 2D as assets as well.
After all, we have video stock currently. But placing objects in different positions and viewing them from different angles was always quite difficult. But if you capture them in 3D, you can capture them in their real size. And by making them into assets, you can of course freely position them, utilize, and combine them. In the gaming world, the concept of assets is generally recognized. However, the concept of using real-world captures as assets has yet to catch on.
But as this is implemented, the images that you've captured can be used in movies. The concept of uploading your footage and turning it into an asset for other people to use is an extremely important point. Thank you. I see. We have time for another question. Mr. Nagahama.
You talked about creating backgrounds and sets through volumetric capture. I think that's what you mean when you say assets. But you also mentioned digital humans.
As Mr. Nomoto just said, he felt like he could be cloned. What will be most important in making assets of humans as 3D data? Good question. We actually have not achieved that as of yet. We have of course created assets of people in games or as characters. However, we have only been able to capture light and the figure itself.
To use this asset, we must make it move and express in a certain way. This requires movement and acting. It is extremely difficult to achieve this now.
Going forward, generating autonomous movement will become very important. For natural movement, we're starting with throwing a ball and similar actions. We're trying to capture exactly how each individual throws the ball. I think that this individuality will be very important in creating digital humans.
Being able to tell the person from the movement. Really interesting. Thank you. We have had Mr. Yutaka and Mr. Tanaka talk about content creation technology. Next, we will talk about how to deliver this content, starting with Mr. Nagano, on the subject of video signal processing. I'm Takahiro Nagano from the R&D center. I develop technology to reproduce high quality video.
One such technology is super-resolution imaging. This will be used in all video content for Sony equipment. For example, if you had a video with the quality on the left, it would automatically be converted to high definition like on the right.
These are the initiatives that have used this technology so far. Sony was the first company to introduce super-resolution with AI in products. It was used in 4K and 8K TVs and projectors, we had super-resolution zoom for cameras and smartphones and even 4K endoscopes. These products are mostly for 2D content. Until now, we recreated this 2D content in high definition. However, in the future, as Mr. Tanaka stated before,
Sony's business is spreading from electronics toward entertainment. For example, animation production or film made with virtual production, games, XR, etc. A wide range of genres. Today, our content is shifting from 2D to 3D. We are developing technologies to create 3D content in high definition in real-time. I would like to introduce one core technology for this.
Mr. Yutaka also explained this earlier. High-quality ray tracing for 3D content requires enormous computing power. You must shrink computation significantly to create this in real-time.
We are developing technology to complement this by signal processing to reduce rendering time and improve quality. There are two specific methods. One is to compress the size of the image. By processing the compressed image and rescaling it with super-resolution, you can recreate high definition images. The other is the method Mr. Yutaka already spoke of.
By reducing the number of rays traced, you create a very rough and noisy image. By restoring this with AI-based signal processing, you create a beautiful image. Let's look at an example of this. This is part of a collaboration with Sony Pictures Entertainment. It is an animation called "Hotel Transylvania 4."
This is a test using video from this film. The left side has had nothing done to it. It takes a lot of time to make content and doing ray tracing each time. We used simplified computing to reduce this time. However, this leads to lower quality. This results in the left image.
By using the super-resolution technology, you can transform it as on the right. If you look closely, you can see the fine details, such as the outlines, recreated. This is the result of processing another video. You can see that the detail in the old man's hair is recreated well in the right image. Now let's take a look at this processing on a more photorealistic image. The left had reduced ray tracing, so it seems to have lost its texture.
The right image is after we processed the image. If you watch the video, you can see the shine of metal and the clarity of glass are recreated very well. These examples show what is possible. We will continue to contribute to consumer electronics like TVs and cameras.
But we will also spread to new entertainment fields such as games, film production, XR, and live streaming. I would like to develop technology to create beautiful content in real-time. That is all. Thank you, Mr. Nagano.
The shininess in the video was amazing. The power of this signal processing. Image processing is Sony's specialty. I see how this will continue to move forward into the 3D world. Thank you. Let's move on to the discussion. Mr. Mukawa hasn't spoken yet. Please go ahead. Mr. Nagano brought up the topic of virtual production.
With COVID-19, virtual production will become more important for film production. I want to ask how high-resolution imaging would be applied to virtual production. I think there are two big ways.
One is that teams must create background assets beforehand in virtual production. They must be made in 3D. You do 3D modeling from many 2D images.
There can be issues of poor resolution, lots of noise, and missing images. Our imaging technology can be used in these cases. We are aiming for this process to be almost fully automated.
We can project a 3D image produced in this way and shoot with the actors performing in front of it. In this case, the background must be adjusted based on camera position. You really need a high quality background for this. We can use ray tracing acceleration technology as I talked about before. We think it is possible to create a real background with shiny and clear objects. I see. Thank you.
Going around, let's hear from Mr. Yutaka. Quality is really important when transmitting content end-to-end, but people in the gaming or interactive world are also interested in latency. What do you think about that? All processes from production, transmission, to viewing may cause some latency. We are also developing a video codec technology for this. We are trying to compress video when transmitting 3D images.
We are also developing technology to reduce latency. If you achieve low latency, the quality of the codec itself drops. By combing this with super-resolution technology, we feel we can recreate an image without damaging the quality.
By combining all of this, we can achieve high quality, highly compressed, and low latency video. We can have it all. That's right. Amazing.
Thank you. Thank you. Mr. Nagahama will talk about end-to-end latency at the end. Next up is Sony's specialty of specialties. Mr. Asada from the R&D Center will discuss reality in sound processing.
I am Kohei Asada, in charge of audio technology at the R&D center. Focusing on reality, I think spatial perception is most important. In other words, awareness of the world by our visual and audio senses is key. Visuals tend to be put first, and in fact all the talk has been on video so far. The role of sight and hearing are different and they complement one another. For example, the eye is limited to what is before it.
We can actively look at or avoid looking at something. The ear can always hear in all directions. You cannot close them, so you are always hearing. Eyes have superior spatial resolution, but ears have high time resolution. You can see this situation on the lower band.
Have you ever heard a siren while driving, but not known where it is, and you had to actively move your head before you found it? This is one survival mechanism that humans and animals acquired. In particular, when you can't see over a wall, or in the dark, hearing is more important than sight to understanding the situation. You can tell if someone is on the other side or is talking. Or if you go into a tunnel, you can guess the size of the space or the material of the walls from the echo.
The information you gain from sound is key to creating reality. This page has all of Sony's audio programs. Sony has a long history in audio technology. It is our tradition. The motivation for this evolution was the pursuit of reality. In particular, we tried to recreate the sound of concerts.
Sony has a history of recording and mixing to recreate real sound. It moved from monoaural to stereo, and then surround sound in home theaters. In 2019, we used object-based technology to launch 360 Reality Audio that can position sound spherically.
Object audio that didn't rely on speaker position enhanced the spatial sensation. I would like to analyze the elements of reality in sound. This image is split into three stages. First is sound source. What size or directionality will each sound produce? Then spatial propagation. How does sound cross the space to reverberate?
Lastly is listening. How does sound enter the two eardrums? It is easy to understand by breaking it down like this. These are the physics. For reality, we need to consider how humans perceive and interpret sound. For example, information on the source matters the most in terms of sensing changes or communicating. Spatial propagation gives us spatial information, like the size of the room or the material of the walls. In listening, we perceive where the sound source is, and determine the next action, such as talking or avoiding danger.
So in pursuit of reality, we need to consider how humans perceive sound, and add relevant information at each stage. This blue portion is where we expect to evolve going forward. I think we need more advancements for sound source and spatial propagation. There is not enough reality in reproducing the direction or size of sound source. We are replicating spatial propagation through very simple models and the time and effort of creators. For video, high-precision simulation-based rending in real-time is advancing.
Audio also needs to reach the same level. In addition to simulation technology, we can take a big hint from human cognition and memory, as shown on the right. These are all of the thoughts I have for you today. I'll take any questions you have. Thank you. Audio technology...
The major advancements have been in video quality, but object audio seems to be an inflection point in its own right. I can see that quantifying perception seems to be very difficult. But we seem to have more knowledge on how audio affects the mind than video.
It's not easy, though. Let's move on to the discussion. Mr. Tanaka, please go ahead. Reality and real-time were key words here.
Are you making any efforts on the remote side? You're bringing up remote quite a bit, Mr. Tanaka. Yes, as you can see in this slide, I have examples of technology on the remote side. We have the 360VME (virtual mixing environment). This uses 3D audio technology for sound production speakers.
It measures the sound environment, the room, and individual hearing traits. It is able to recreate this sound in headphones. This technology focuses on precision for measuring, calculation, and recreation. People can't tell if a sound is coming from the speakers or the headphone.
Take a look at this picture. In Sony Pictures Entertainment, there is a mixing stage the size of a theater. The sound creator mixes sound here. They can adjust the position and tone of sounds coming from all directions. This stage was closed during the COVID-19 lockdowns.
At that time, we worked with SPE to use 360VME to recreate the mixing stage in the homes of sound creators. Several SPE films last year actually used this for sound production. Looking back, reality in sound and real-time processing were used remotely, demonstrating how the 3R contributed to business. That's where we're going with that.
We can take another. Mr. Nagano. You emphasized the concepts of perception and memory in your talk. Could you explain those a little more? Accurate simulation is part of this pursuit of reality.
But I think reality involves more than just that. As an expert on vision, I'm sure you've heard the term memory color. There are also processes that focus on cognition in TV. I think there is also memory sound in the audio field. Even if we can reproduce sounds accurately in simulations, there may be something off or lacking in that sound.
In film, games, and music, there are sound creators for each field. It is important to be creative to harmonize sound and express reality fitting of the worlds these creators make. As audio is invisible, it tends to have an unconscious effect on emotion. So I think there is a lot of know-how involved in recreating reality. We should continue to learn about this as engineers. I want the Sony Group to collaborate more with entertainment sound creators.
The plan is to use this as the core for future technology. Thank you. Thank you. I also love audio, and I have a specific audio setup in my home. The 360VME was really amazing. It was honestly the most overwhelming audio experience of my life.
Thank you. It was so realistic. I wish I had that in my home. Personally.
Thank you. Next is Mr. Mukawa. He will discuss virtual reality and augmented reality, new trends that many users are already experiencing in entertainment. I am Hiroshi Mukawa from Sony Semiconductor Solutions. For 20 years, I worked to develop the ARMR head mounted display. I would like to discuss AR/VR as tech that can deliver realistic experiences.
First, I will introduce the AR/VR products we produced at Sony. We released the Glasstron in 1996, the first consumer head-mounted display. Since then, we developed various head-mounted displays for AR/VR.
On top, we launched PlayStation VR in 2016. The PlayStation VR 2 is coming to market next year. AR is on the bottom. We launched movie subtitle glasses in 2012. In 2015, we launched the SmartEyeglass. In 2019, we prototyped an AR head-mounted display with a spatial tracking function. This was used for entertainment content and consumer demos.
AR/VR technology delivers experiences that cross the real and virtual worlds. AR layers digital content over the real world to give a sense of presence and reality to the content. VR places the user inside a virtual world that is immersive.
The key is how to heighten reality and immersiveness. Using reality/real-time technology, we can trick the five senses. The key senses are sight, touch, and hearing. As Mr. Asada explained audio, I will explain the visual aspect. There are two requirements to heighten immersiveness: reality and real-time. Reality includes factors such as 3D display, wide FOV, and high resolution.
For real-time, world locking is an extremely important point. World locking means that when content is shown on a head-mounted display, the content stays still within the space even if the person moves their head. This video demonstrates this concept. This shows a white cube displayed on an optical see-through AR HMD. The see-through background is also projected.
If the person's head moves, the white cube will move along with it against the backdrop. Let's add latency assurance technology. Now, the cube is firmly attached to the background. This is an important factor in realism. There are still a number of issues in AR/VR displays.
Let me talk about one challenge to creating realism and immersion. This is about human vision. Generally, humans can perceive color in a 120 degree range horizontally.
People with average eyesight can analyze 60 monochrome pixels per degree. In 120 degrees horizontal and 80 degrees vertical, an HMD would need a resolution of 70 megapixels in each eye. This is twice the resolution of 8K video. As I said earlier, world locking is also very important.
This is hardest for optical see-through, but when doing this with AR, you can only have a latency of a few milliseconds. No mobile semiconductors will be able to do this in the near future. So what do we do? We use what we know about human perception and try to reduce the number of pixels we produce.
Human vision is very strong around the central line of sight. Farther out, it begins to decline quickly. So by only drawing objects in the user's line of sight in high definition, while the periphery is in low resolution, we can greatly reduce the cost of imaging. This requires eye tracking technology and technology for reducing data.
Also, low-latency imaging technology is required. This is a process flow chart for AR/VR systems. First is sensing. Then recognition. Users use this technology to understand the space. Based on this understanding, the applications decides a response. Based on the response, it renders data that outputs display, audio, and haptics.
I explained this loop based on vision. However, in order to enhance the immersiveness of AR/VR systems, we must conduct this loop quickly and accurately with few calculations. It is important to dive into human senses and perception, not just technology. We are focusing on these efforts.
That is all from me. Thanks, Mr. Mukawa. You are Mr. AR at Sony, since you've led the development for so long. As Mr. Asada also said, all paths seem to lead to cognition. That's common.
Yes. Next, let's move to the discussion. Mr. Yutaka, I assume you have a question related to AR and VR? One topic for AR/VR is the comfort of the wearable. In gaming, you can be immersed for hours.
Is it possible to create a wearable that is more natural for AR/VR? That is a very pressing issue. There are two aspects to comfort with wearables. One is comfort in wearing. The other is comfort in viewing the content.
Comfort in wearing is linked to the weight of the device. Displays have been heavy, but we are working to make them lighter and smaller. The core issues are the SoC, imager, and the light source of the display.
They emit heat. So cooling will be a factor in determining weight and size. It might seem obvious, but lowering the electricity consumed would be the most important factor in reducing weight. There is also comfort in viewing. The issue now is misalignment in the convergence.
Simply put, this is depth perception when the eyes focus. Since we have binocular vision, the eyes will be at an angle when focusing on closer objects, and parallel on far objects. If this depth perception is off, it results in discomfort.
There are many optical solutions for this. For instance, moving the liquid crystal lens dynamically, so the focus is consistent with the depth of the content display. Or circumvent the human eye's focusing altogether, drawing the image directly on the retina using fine optical rays There are several such efforts, but it will take some time to commercialize.
This is a very serious issue we and others are trying to tackle. Right. The key is how to present that in a natural way. Right. Thank you. We have one more discussion.
Mr. Nagahama, you'll be touching on the metaverse in your part, but as this is related to AR/VR, do you have any questions? This is linked to Mr. Yutaka's question. I think making devices compact is key to reducing the burden of wearables. You spoke about making displays smaller. How do you think efforts to make displays smaller will proceed? It's hard to explain in words, so I will use slides here. AR is on top and VR on bottom. Now, AR uses glass waveguides
and projectors with reflective LCD panels. Next, projectors will be made more compact, using very bright natural light panels called micro LED. Using this device makes the projector smaller. Right now, the glass is completely flat. Instead, we will use curved plastic like we have with sunglasses. This is a more natural style.
VR originally used LCD panels and Fresnel Lenses. Recently, we've used OLED and lenses called Pancake Optics for refraction. Using this lens helps to make thinner displays. Pancake Lenses are just normal refractory lenses.
This is a lens called Flat Optics, which uses an LCD. This is thinner and even lighter. This will become common. There is ongoing development for AR and VR to ultimately embed displays in contact lenses. That seems like a fantasy. But in the US, there is actually a prototype that the FDA has approved.
Testing has begun. It will take at least ten years. But we might see displays in contact lenses with very large field of view leading the field of AR and VR, and I am excited for that. With contact lenses, anyone can enjoy that entertainment. It is far in the future. I can't wait. Thank you.
Thank you. Let's move to the next topic. Mr. Nagahama. Sorry to keep you waiting. Please tell us about the end-to-end systems that link capture to experience. I am Hiroki Nagahama of the R&D Center. I would like to give a general overview of the 3D-3R end-to-end system.
As Mr. Nomoto spoke about, you all discussed the cycle from capture to processing to final display. This image shows the end-to-end system. First, it captures the real world. The real physical world. The key word is digitizing the physical space. We can then process the 3D data of this digitization.
Finally, this processed data is returned to a realistic format humans can view. From real to virtual to real. This is the overall system. As you spoke about, the essence of 3D-3R is using 3D data to pursue reality. This requires a lot of computing power. Remote is part of the 3R. Networks are needed to connect the captured content and the viewed content.
Our challenge is how to use computing and networks to realize this system. The performance of networks and computing are steadily rising. So, how does this affect 3D-3R applications? Right now, 3D-3R application demands are much higher than current network and computing can meet. The graph on the left shows that 3D or volumetric video has about 1,000 times the data as a 2D video. The graph on the right shows the computing or processing volume. So when it comes to making a 3D volumetric video, it requires much more computing performance than 2D videos.
As Mr. Nagano stated, there are many techniques. You can use codecs to compress and other ways to reduce this gap. But consider the metaverse.
The metaverse, essentially, is a virtual space with multiple users sharing a 3D experience. It is extremely challenging in terms of networks and computing. We have the individual technologies, but the key is creating the system architecture.
As a real example, I will explain using the example of a metaverse or multi-player game. On the left is a multiplayer game. This is a typical example. Multiplayer games are played over networks.
There is a game server in the cloud, as well as client devices, either a game console, smartphone, or HMD. The application runs on this client device. For example, data is sent to a server if the player moves a character. This data is gathered and sent to other client devices to synchronize.
This server-client model is standard. As the application runs on this client device, it has the advantage of low latency of inputs. On the other hand, if updates are needed, the client must update the software to continue the experience. The client device makes a difference, as well.
The PlayStation is very powerful, but smartphones have yet to catch up. There is the disadvantage of limited graphics quality due to this. Looking at the right, we have an example using cloud gaming technology. Here, the client device acts as a thin client. The controls input on the client device are sent to the cloud server. The application runs in the cloud, which produces the video and audio.
This content is returned to the client device by a network. In this case, the application is run in the cloud. It can update without the user knowing. It can evolve more steadily. Another advantage is that it can use cloud computing to reproduce beautiful graphics.
It is also possible for quality graphics beyond client device performance. But using the network this way, there is the issue of latency. The issue is that the reaction to inputs will inevitably be delayed. And while we do have game servers running in the cloud today, running the game client in the cloud would involve much more cost.
Each style has its pros and cons. It is important to select the optimal method based on these factors. I would like to give two examples of future development. One is recreating a soccer match in a virtual space. We placed an actual camera in a stadium and tracked player and ball movement. It captured just the poses of players during movement and this data was sent to client devices.
This was reproduced as CGI with 3D models on the client device. Volumetric data is very large, so we limit this by using only the pose data. We reproduced this in CGI to make this 3D experience. This was a joint venture with Manchester City Football Club in the Premier League. We use this technology to test what kinds of experiences are possible. The other example is mobility, as was already brought up, an example of remote driving in remote areas.
This is an application of streaming technology. We had a Sony car called VISION-S at a German test course controlled remotely from Tokyo. Visual data from the car was sent so the driver in Tokyo could control it. We tested this as a use for streaming technology, in order to apply it to mobility. Lastly, I have a summary.
How do we at Sony fuse the real and virtual worlds to make a new world? How do we deliver entertainment or create new services? What do we do with cloud computing, and what do we do with client devices? Where do we process what, and what do we send? I think we need to consider these combinations in the future. In particular, we can use edge servers to process data closer to the user and return this processing at extremely low latency. A big challenge for us is how to combine computing resources and bring about distributed computing.
Unfortunately, we cannot find a system that can achieve everything. There are factors like latency, cost, and other issues. Looking at these trade-offs, we need to consider our business model and applications, and develop the system accordingly. That is all from me. Thank you. Mr. Nagahama, you brought up systems technologies
like distributed computing platforms, each of which have their own issues. None of them are perfect, but optimizing each one, and orchestrating computing is key to delivering the final service. I would like to start the discussion. Do you have a question, Mr. Mukawa?
As I spoke about before, you need to recreate high resolution images in real-time in HMD. There was also talk of cloud rendering and edge rendering. When do you think you will develop the technology to stream AR for HMD, especially low-latency AR, which is the biggest challenge? That is a tough question. Optical see-through AR is difficult. As you stated, it is a big challenge to render images in milliseconds. I honestly don't know how well we will be able to do it.
We are already working on entering VR virtual spaces. Latency is an issue, but as you spoke about, you can do high resolution rendering only in the center of sight. In the same way, by sending data slightly outside the scope of sight beforehand you can follow it even if you shake your head a little. In VR, I think you can render in the cloud and send the data from there. But AR is more difficult. So we'll work hard.
Together. Thank you. Who's next? Mr. Asada, go ahead. We have heard a lot about video today, but speaking about audio, what sort of plan do you have to address real-time audio? There is less data involved in audio compared to video, so it may appear easy, but I am sure you are aware that it is easier to feel latency with audio.
When we use tools for remote conferencing, we have all experienced lag in conversation. This is latency. This is a difficult issue. General conferencing tools use network protocols to ensure anyone can use them in any environment. Instead, we can establish networks for a limited number of applications, or even have dedicated networks.
We can limit demand for people speaking even as many listen. We can create optimal protocols by limiting the number of people speaking. Such measures can help produce low latency audio. We are currently testing these different options. And as you said, where we previously sent stereo audio, we can now send object-based audio in 360 degrees. I hope we can work together on these efforts.
That's great. Thank you. Thank you.
In our panel discussion, each of you spoke about your area's technology. Lastly, I want to discuss the overall direction. You spoke about how your 3D-3R technology fields would change the world.
From a wider perspective, how will 3D-3R technology change the experience for creators and users? I'll just pick speakers at random. Mr. Nagano. Right. What will change by achieving 3D-3R technology? Many areas will change a lot. In particular, I think remote communication could change a lot.
As Mr. Nagahama spoke about, we all use conferencing tools. But it doesn't really feel like the person is there with you. It can be very stressful to talk. As Mr. Nomoto spoke about at the beginning, the world will change if we can achieve immersiveness, realism, interactivity, and low latency. If we can communicate like the person is there, be it your grandparents or whoever else, it would greatly change the depth of communication.
I think the concept of distance would change a lot. That's right. Remote work increased during COVID and became the new normal. Even as we recovered, remote work has become more common.
And it's very comfortable. It's important to discuss how to make these conversations possible. Mr. Tanaka?
I think my thoughts are the same as Mr. Nagano since we work together. I have expectations for content creation. One of my favorite movies is "Star Wars." In meetings like this, some people may be in the flesh while others may join as holographs, like in "Star Wars." That's one thing I want to do. And that involves capturing space, transmitting it, and then displaying it.
I cannot explain all of the fine details. But we are testing recreating full size 3D holograms in real-time from a volumetric studio. It seems like people fly in like teleportation. It is really moving to see that person appear before you.
I've spoken about this with Mr. Asada. We didn't like how the voice came from a speaker. So we'll use a combination of technologies to make the voice come from the mouth. We can use all of Sony technology to make the person seem really there.
It will be a pleasant experience. Please continue to support us. Thank you. Teleportation is the ultimate form of remote communication. Mr. Yutaka. I have always focused on the interactive. I think interactive technology is a big part of 3D-3R.
This is easy to understand if you've played games before. You experience it, not just watch it. This is because of interactivity. Things happen based on what you do.
That really changes the experience a lot. With games, you can achieve complete interactivity by making a world. Some things may be hard to do in the real world, but if you shake your head with an HMD, you can instinctively see this. This is part of interactive. I think these digital experiences will become more natural. I think that will be a big point that changes compared to the past.
Now, many things are being gamified. I don't know if e-Sports counts as gamification, but education and tourism are being gamified. Game technology is migrating into many different services. Right. Interactive experiences in education help
make learning more personal. Same for tourism. It is key to see how interactive experiences are being integrated. Your simulation technology is going many places. Mr. Mukawa? I have always focused on AR. With 3D-3R, I think the entertainment field will change a lot.
Take TV, the most immediate form of entertainment to us. We usually enjoy TV in our living spaces. But instead of just the TV signal, we can add data such as haptics and volumetrics. For example, actors could come flying out into your living room. They could be very realistic. Virtual humans, as Mr. Tanaka said. Actors jumping out.
We didn't talk about haptics, but you could shake hands or even smell them. 2D is good for some things, but expanding on the 2D TV, the entire room could become an entertainment space. It will take time, but this would change entertainment and TVs. Multi-modal is another concept, of recreating the five senses. The focus is on video and audio today.
But haptics, smell, and taste are options. Haptics means touch. There may be a sixth sense one day. Multi-modal is a big point in enhancing realism.
That is important and will see a lot of coming development. I see. How about you, Mr. Asada? How do you see audio? I have felt how important networks are to capture and send video and audio. As was mentioned in the introduction, I think it is important to really feel like you are there.
What can we do to achieve this? I got a hint at a South by Southwest exhibition. It would record your voice in real-time in a dark room, and add a reverb to give it a sense of space. We're all used to our own voices. While not quite echolocation, we have a sense of our own voices. So we can feel the space when we hear our own voice in it.
We're extremely sensitive to sound, especially human voices. If you add real-time and interactivity into that, you can feel that you are really there. I want to dive deeper into this. There is unknown potential in voice. That's right. Thank you. Lastly, how about you, Mr. Nagahama? Going back to the presentation.
I think communication is very important, as you spoke about. Creating communication where people really feel that they are there. Ultimately, when communicating, there are cases when real people need to be present, but we can use digital 3D data such as in game worlds. Mr. Yutaka spoke about how digital data could be processed. Combining this with communication allows us to use 3D-3R technology to not just swap the real with the digital but upgrade the real with digital in communication. For example, if I raise this hand it could stretch out like this.
It might be good to add a sense of entertainment. Lastly, we say that Sony values getting closer to people. So it's important to use 3D-3R technology to connect with people in social settings. I spoke about the collaboration with Manchester City FC in my part. The goal was for fan engagement.
We used 3D-3R technology to excite fans, ultimately to build social relationships. I think that is a large field of research for the future. This might sound like mere rhetoric, but a big question is how to use 3D-3R technology to surpass 3D-3R. How can we surpass the real? Surpass the real-time? I think a major challenge will be how to surpass 3D-3R. Thank you. Thank you for coming to this 3D-3R session.
In light of the panel members' messages, I would like to close out this session with a declaration to fill the world with emotion using 3D-3R. Thank you. Thank you.