The "VTuber" and Why Artificial Intelligence has Limits
Let's talk about VTubers. Amygdala comes from the Greek word for almond and it's the part of the brain associated with recognizing emotion from facial expressions or at least most of them probably. It's an effective evolutionary trait in recognizing emotions. In fact, it's so effective that sometimes emotions can be recognized in non-human face havers such as animals or drawings. But determining the emotion from a facial expression is a lot more than just selecting one from a set. There are many different facial structures and yet the same minor muscular movements can determine the same slight shift in intensity or nature of emotion. And so
the question is how does the brain recognise and piece together all these seemingly insignificant details to understand what's being communicated by a facial expression? For the past few decades, a popular interpretation of the brain has been the comparison to the computer. Both of them after all are things that have parts in them that do things. In conventional computer software, a simple program follows a sequence of logical steps. To find a red dot in an image for example, you can check every single pixel until you get to a pixel that is red with no red around it and to find the face and all of its features you just do the same thing but replace a red dot with a face.
But people are different coloured pixels and even if we weren't there's no difference between this pixel and this pixel. So maybe instead we should look at the combinations of pixels - at the edges of the image. But even then there's really very little difference between this edge and this edge and so maybe instead we should look at the combinations of edges - at the shapes. In 1964 Dr. Woodrow Wilson Bledsoe published a report on a project for recognizing faces called the Facial Recognition Project Report. The goal of the project was to match images of faces to names and
one of the strategies to doing this was first looking at the features of the face. Features can include key points such as the hairline, the corners of the eyes, or the tip of the nose. These features can then be combined and the distances between the features are used to recognise and classify the face against many faces in a dataset. Of courses not all faces are always directly
facing camera. Some of them are facing to the left and some of them are facing the consequences of their creation. To correct this, mathematical transformations were applied to the distances to face the face face-forward. Unfortunately, not much else is known about this project due to confidentiality but it is the first known significant attempt at computers processing human faces - so long as those faces are at a reasonable angle and lighting and age and isn't wearing a hat. Ultimately, due to limitations in technology, the project was labeled as unsuccessful but it did highlight a problem in processing faces in that faces were just too different from each other and different from themselves in different settings. Any singular hard-coded algorithm had to be very convoluted with potentially a lot of error but that's okay though because this was in the 60s so they still had plenty of time to figure it out before I make this video.
VTuber, short for virtual tuber short for virtual YouTuber is a YouTuber that is virtual as opposed to regular YouTubers who are authentic and genuine. Being YouTubers, they are most commonly found on Twitch and have the potential to be as various in content as any other content creator. In other words, swaying back and forth with their mouths open. VTubers take obvious inspiration from Vocaloid performances such as those of Hatsune Miku but unlike Vocaloids there isn't a singular history of VTubers since 1964 as VTubers are not a singular organisation or even a singular technology. Most VTuber historians like to start the history at Kizuna AI though who is in every sense of the word a virtual YouTuber and the first virtual YouTuber if you don't count Super Sonico or the Weatheroid or Annoying Orange. Ai is the Japanese word for love
but it also has the obvious second meaning A.I. - the abbreviation for artificial intelligence. It could also stand for Adobe Illustrator but it does not because that is too many puns and also would not make sense. The character proclaims herself to be an artificial intelligence and by artificial intelligence she means animation software with a production crew but how exactly the human operator and voice are mapped to the model has been mostly left to speculation due to confidentiality. It's also noteworthy that unlike most traditional YouTubers she's somewhat of a corporate mascot under Kizuna AI Inc made apparent by her involvement with commercials.
Nonetheless, her character and her format of 3D animated voice acted video productions is the first usage of the word the VTuber and so it's a good place to start. Unfortunately, Kizuna never left behind a formal definition for the word VTuber and so its definition at least as a medium has been left fairly open-ended. If you take the definition to be the online video form substitution of one's identity by animated character then the concept of VTubing is not very novel nor uncommon. On one hand, you have static expressions chosen from a set of pre-drawn
illustrations or PNGTubers and on the other hand you have full body motion capture devices connected to a model within a game engine such as that of CodeMiko. These two extremes sit at either end of the spectrum from most affordable but least immersive to most immersive and least affordable but they both solve the facial recognition from image problem in the best way possible: by not solving it. It's much easier to find a dot on a screen than a face so what if we made people's faces into dots or rather multiple dots. In 1990, Lance Williams decided to track faces by
not tracking faces and published a paper on Performance-Driven Facial Animation in which retroreflective dots that were easy to detect by a computer were applied to a person's face to be tracked and then mapped to a 3D model. Williams performed this on himself for budgetary reasons and not because he wanted to become an anime character. This would be one of the first instances of marker-based facial motion capture for animation: a technique that can be held accountable for The Polar Express. But it's unreliable and bothersome both of which are bad things and so it has nothing to do with the video. If we ignore all the other parts of the body, CodeMiko's facial animation uses iPhone X's FaceID. By using projectors to project
a light onto the face and then sensing the depth of this reflected light using sensors, a three-dimensional image is created thus avoiding the problem of angle. And since the projectors and sensors are projecting and sensing infrared light rather than visible light on top of the infrared light that your body radiates naturally, the lighting around the face does not affect the image. The entire solution is thus in the hardware and it works pretty well even on faces that are wearing hats. However, how exactly a three-dimensional depth map is achieved from a face with lights on it is something that we're not going to get into because hardware is scary but mostly due to confidentiality though it doesn't take an Apple engineer to make the observation that light patterns distort themselves when reflected on three-dimensional surfaces which could help indicate the shapes of those surfaces. Apple's
FaceID remains dominant in the IR camera facial mapping market. Google's Pixel 4 had a similar system called uDepth which used a stereo depth sensing system otherwise known as two cameras similar to how you have two eyes to sense depth but this was discontinued and the other one is Xbox Kinect. All of this wasn't developed just for Apple's primary demographic of VTubers though. The main selling point of FaceID is its biometric authentication system and also Animoji. But where
VTubing comes in is the tool that Apple provides to developers: the ARKit. Developers can build apps around this tool such as Live Link which feeds the facial data directly into Unreal Engine which is what CodeMiko uses. But what if you can't afford an iPhone X or just despise Apple? Surely there's another way to VTube from your webcam or camera. In fact, it's probably the technology
you've been thinking of since we brought up brains and facial recognition. Microsoft Excel has a tool that allows you to draw a trendline that best represents a scatter plot. Most data is probably not linear but a linear line can still be used to predict y values given x values. Of course, this prediction could just be terrible and so Microsoft Excel has to minimalise the distance between every single point and the line to find the line of best fit. This process is called linear regression. Linear means relating to lines and comes from the word line and regression means
estimating the relationship between a dependent variable and many independent variables and comes from the 19th century bean machine. You may have noticed from that last sentence that there are many independent variables. Linear regression is useful for drawing lines through three-dimensional and four-dimensional and whatever dimensional scatter plots. Every new dimension is just another variable or feature that affects the output and predicted output in the y-axis. Using linear
regression to predict how long a person is going to watch through a video, the features may include the length of the video, the age of the person, and how much of the video is about statistical theory. And to make predictions off of say images of faces for example, the features could be every single color value of every individual pixel on the image. But making predictions of something as advanced as an image of a face may not be as simple as just drawing a line. A linear fit might not be best appropriate for our line or hypothesis to every single feature. It might work better as a quadratic fit or cubic fit. By adding more features or dimensions
that are equal to the square or the cube or the whatever of the previously established features, we can do polynomial regression which is actually just another type of linear regression because the hypothesis is linearly proportional to something that is non-linearly proportional to the original data. You can also combine features and make new features with them by multiplying them together such as if you have a height feature and a width feature you can instead have an area feature. But making predictions off of something as advanced as an image of a face may not be as simple as just drawing a multivariate nth degree polynomial. We know that we can modify and combine features to make new features to optimally fit our hypothesis to data but in what way do you modify the features? Which features do you combine? How do you even do any of that for thousands of pictures that have hundreds of thousands of pixels and millions of RGB values? Who is Gawr Gura? A slightly controversial sort of lie is that linear regression as well as all other types of regression are a form of artificial intelligence. In fact, if you sort of lie, anything can be a form of artificial intelligence. You yourself at home may already know a deal about artificial intelligence, either from your own extensive research and experience or your ability to lie, but artificial intelligence isn't so much of an algorithm as it is the idea of artificially creating something that is intelligent or at least seems so. Most of what you may know to be
artificial intelligence is the method of machine learning called the artificial neural network. A neural network or a network of neurons is a system of units that receive information, process it, and pass it on to other units in order for the entire system to make a prediction - quite similar to the neurons of a brain. It's also quite similar to Danganronpa in that regard. Neural networks and all of machine learning is a big deal because it allows programmers to do things they typically can't do on their own such as play board games at a grandmaster level or hold a conversation. This is because unlike in conventional programming where the programmer knows what they're doing, in machine learning the program learns how to do it on its own without the programmer really needing to know how it was done. But machines don't have feelings so how and what exactly is the machine learning? The units or neurons of a neural network are organized into layers. The first layer or the
input layer is where the inputted features are received. For every inputted feature there is a neuron in the input layer. Each feature within each neuron can then contribute by some weighting to the features within the next layer of neurons. The different weighted sums of all the features of
this layer is thus the information received by the neurons of the next layer. This next layer, called a hidden layer, then applies some processing to the information in order to make it harder to explain. First, it adds a number called the bias value in case the information is going below a certain threshold that it shouldn't and then it puts it all through an activation function which is just some non-linear function so that the features of this layer are not necessarily linearly related to the features of the previous ones. These newly activated features can then be passed on to the next layer to repeat the process and make more features off of these features. Through this, the features of each layer are like a combination of the features of the previous ones: from pixels to edges to shapes to faces. If there are many many layers that compute very
very specific or complicated features then the entire network can be called a deep neural network because it is very long. Eventually, it reaches an output layer which has the number of neurons as the number of things you're trying to predict. The values received here are the predictions that the model is giving based off of the input. To train a model is to figure out all of its weights and biases which are altogether called parameters. This decides how each feature fits into the next
layer of features. To do this is just the simple task of finding or creating hundreds of thousands of pieces of data. The input data can be put through the model and the predicted output can be compared with the actual true value that was manually determined by a human. The function that
does this comparison and determines how wrong the model is is called the cost function. We can then go backwards through the model to find out how each parameter can be changed in order to lower this cost function. This part is called backpropagation and if you know calculus it's a quick way to calculate the partial derivative of the cost function with respect to every parameter and if you don't know calculus well it's the same thing but you wouldn't understand. The neural network relies on training with many sets of data in order to improve itself with each set, hence the name machine learning. Now admittedly, all of that may have been a bit of an oversimplification but it's the backbone of machine learning and the model used for computer vision and more specifically object detection which would be true if I wasn't lying. Different architectures of neural network can have different activation functions, cost functions, number of layers, and number of neurons per hidden layer. The architecture for a computer
vision model in which the input is an image matrix is even more convoluted as it is a convolutional neural network. An RGB image can be thought of as three matrices: one for each RGB value of every pixel. However, it would be a lot of weights to put every single pixel into a weighted sum for every single feature of the next layer. Rather, the more efficient technique devised is to take these matrices of features and pass them through a filter that forms some number of matrices of new features for the next layer. The parameters here for the convolutional layers are the values that make up the filters for each layer. There are then also pooling layers that reduce the number of parameters by throwing them away and hoping it works and then near the end of the network we may have some fully connected layers that are just the same layers as before with weights and biases to sort of check to see if there are any relationships we're missing between all the features now that there are less features. Finally, the vector of features that we're left
with is put through some regression function to perform the actual classification or localisation or detection or moral dilemma conclusion for your autonomous vehicle. Just like with the basic neural network, the convolutional neural network or ConvNet if you're running out of time is filtering for more and more specific features with each layer. Also, this was once again an absurdly oversimplified oversimplification that is ignoring a lot of the math though this time I'm not lying in any of the information I've given. Computer vision is fairly big and so a lot of research has been put into it as well as a lot of abstractions from its bare mathematical bones to the point where training and then running a model could take the copy and paste skills of web development. Given just a few hundred images, you can train a model to perform object detection on the body parts of Oikawa Nendoroids. Even you can become a VTuber. In the case of VTubers which is what this video is about, it's not actually object detection but rather facial landmark detection. The output for such a model may be a number denoting whether or not there's a face on
the screen followed by the x and y coordinates of several keypoints along the face, eyebrows, eyes, nose, and lips. You may have noticed that we never answered the question of how a brain detects faces and facial expressions. The answer to that is who knows? It's not my job to teach you neurology. In fact, I'm unemployed. If you take the definition of VTubers to be "a" then its history is a pretty
straightforward series of events. Following the massive success of Kizuna AI in Japan, many other Kizuna-esque VTubers started popping up such as Kaguya Luna and Mirai Akari. It was only a matter of time before agencies that managed several VTubers started appearing to the scenes such as Hololive. The agency Nijisanji broke from the tradition of 3D models and used Live2D which is like 3D but with one less D. Rather than a 3D model of joints, meshes, and textures, Live2d takes several flat images and layers them on top of each other to move at different motions giving the illusion of depth. Perhaps more importantly though is that Nijisanji focused on live streams rather than video productions with which Hololive and other VTubers soon followed suit. And like all
things Japanese, very soon there were fan English subtitles followed by official English subtitles followed by English speaking subsets such as Hololive English producing English groups such as HoloMyth including Gawr Gura followed by entire English-speaking-based agencies such as VShojo. This rise of VTubers from the debut of Kizuna AI to the debut of VShojo is a relatively short period of time from 2016 to 2020. In almost all of the examples I've given thus far though, the technology used for facial tracking is not artificial intelligence. No matter how efficient or accurate a neural network may be it has one fatal flaw in that it is software. Having to put
every frame or every few frames through the neural network to get updated coordinates for our model is a lot of processing. Even with a convolutional neural network that does everything it can in every layer to reduce the number of parameters, image processing is going to be a costly process. This means that in order for the animation to work in real time with what's commercially available today, the smoothness or precision is going to have to be significantly reduced. Add on to the
fact that computer vision is very dependent on lighting, you can't process something from nothing, and it makes sense why both Hololive and Nijisanji provide iPhones to all their incoming VTubers. The TrueDepth system of apple's FaceID still uses software but the hardware part is especially designed specifically for the purpose of facial mapping. This means that rather than being given some massive data and then finding the features that it figured out how to find on its own, the program is given the features of light distortion or depth that coincides directly with the coordinates of the facial landmarks using just some conventionally programmed geometric operations. As funny as it would have been though, it's not like all that talk about machine learning was completely irrelevant. There are still an abundance of VTuber applications for webcam
using ConvNets primarily targeted towards independent youtubers who don't get free iPhones. Luppet, Wakaru, VSeeFace, 3tene which comes with bodily contortions, FaceRig which comes with not being good, to name a few. VTube Studio which is for Live2D is available for webcam, Android, and iOS. For webcam, it uses a model from OpenSeeFace. There it is. Whereas on Android it uses ARCore, both of which are deemed to be of less quality tracking than the iOS version. VTubing is not just facial tracking though but since it's all about tracking a human and mapping it to an avatar, all the other aspects of VTubing short of a mocap suit can use similar technologies.
Hand-trackers such as LeapMotion use IR projectors and sensors to track hand motions which is very handy but also limited because you can't cross your arms so no snarky remarks. Natural language processing problems such as speech recognition require a lot of feature engineering and so neural networks are preferred, inputting human speech and outputting text which can then be either used as another way for mouth tracking or be synthesized back into speech via more neural networks to mask your voice like VShojo's Zentreya. "Head fat". Neural networks, light sensors, and even mocap suits, VR headsets, eye-trackers, and the XBox Kinect are all methods of motion capture. And if I didn't want people to see this video, I could probably title it motion capture at least for up to this point. But that still wouldn't be entirely true as the motion capture required for VTubing is still different than that required for general virtual reality or film production. There is an emphasis in the technology on facial
expressions, affordability, and presenting to an audience in real-time. What's more is that VTubing doesn't have to be and was never meant to be just a one-to-one direct transfer of motion to an avatar. While this goes more into design than development, VTubers can also employ things like object interactability, keyboard shortcuts for pre-programmed animations, or additional physics.
VTubing is like a virtual puppet show or Luppet show you could say or not say actually and just because the puppet strings of motion capture are necessary, doesn't mean you can't improve the show with piles of corpses. Maybe it shouldn't even be a puppet show. Perhaps the future of VTubing should be a looser connection to the puppeteer for more expressive or stylistic animation. A paper was written last year in 2021 for a VTubing software called AlterEcho. The software uses facial expression recognition, acoustic analysis from speech recognition, and mouse or keyboard shortcuts to apply gestures to an avatar on top of motion capture - gestures that the human themselves are not actually doing. The nature or mannerisms of these gestures can be configured by what the paper calls avatar persona parameters such as how shy or confident the VTuber persona is supposed to be. How effective this all is is still unknown though as the software is unavailable and
the paper is still under double-blind review at least at the time of this recording, though the paper itself states that it was rated fairly highly compared to pure motion capture and VMagicMirror which is a keyboard-based software. On the topic of new softwares for independent VTubers, while 2D models are modeled with Live2D, 3D models are not necessarily modeled with actual 3D modeling software like Blender but rather softwares like VRoid Studio which is essentially a character customization screen which has many sliders for incredibly unique customization, though the official stable release has only been out for a year. Currently, both 2D and 3D VTubers suffer from a noticeably homogeneous design and style that some say is reminiscent of Genshin Impact characters whereas others argue is closer to Honkai Impact characters. Perhaps a much more unique easily accessible VTuber avatar creator will be available for the next generation of VTubers. It's unlikely that it will ever break out of that anime niche anytime soon. You had your chance. And it's definitely not going to be whatever Metaverse was supposed to be. But just like how Live2D models have been getting exceedingly creative, 3D models could branch off into as many styles as there are styles of anime which has the opportunity to be aided by a movement towards more motion capture-independent animation-focused VTuber software. In regards
to the future of VTubing, there is another possibility that has been somewhat disregarded since the decline of Kizuna AI and it has to do with the AI part of Kizuna AI. Not love. It's quite common for VTubers nowadays to come with their own lore - the avatar is actually a mythical beast or a magical being. Kizuna, the self-proclaimed first virtual youtuber, of course had the backstory of being an artificial intelligence. This whole backstory stems from the original idea behind artificial intelligence: to mimic human intelligence. A neural network can learn how to find the features it needs to find to locate a facial landmark on an image which could imply that given the right conditions in training it can learn the features of human behavior and produce content. And while in the case of Kizuna, artificial intelligence was only used at most for
landmark detection, there already exist machine learning models that write scripts, interact with humans, play games, and generate animations. There are even neural networks for singing synthesizers such as SynthV which has been mentioned to me by all five of its users. It seems not too far-fetched to just combine all of these to create a truly automated artificial intelligence virtual youtuber. However, we also know that all of these are just several independent abstract networks. The learning that these networks are doing isn't based off of experience or intuition
or arguably even any logical structure. It is just a collection of shifting parameters and mathematical evaluations that knows something is correct because, in the case of supervised learning, we informed it that it was correct. A content creator about as sentient as the trendline generator on Microsoft Excel. We know this even without fully understanding sentience because we know what sentience is not and what it is not is a pattern recognition machine in a computational void. The actual algorithms of human intelligence may still be unknown but because of the way that machine learning was developed, artificial intelligence isn't intelligent in the same way that humans are but it can learn the features of human intelligence and reproduce it to an incredible degree. It's no longer uncommon for humans to be deceived by AI generated conversations or art or music, though perhaps given the nature of parasocial relationships and corporate media, we simply wouldn't mind. Ultimately, whether a content creator who is not
sentient - who we know is not sentient - but can imitate sentience perfectly will be in or against our best interests will be up to us as living breathing humans. Until then though, there are at least a few more Hololive generations before we have to make such a decision so we might as well enjoy what A.I. has to offer and look forward to the future of this still young VTuber era without having to really worry about any unforeseeable threats of other intelligences. "Yahoo"
2022-08-08 01:01