Bringing Generative AI to Life with NVIDIA Jetson

Bringing Generative AI to Life with NVIDIA Jetson

Show Video

Hello everyone, welcome and  thank you for joining us today.  I'm Dusty from NVIDIA and I'm really  excited to be here and share some amazing   advances with you for bringing the latest  generative AI models and LLMs to the edge.  Since the introduction of transformers and then  the Ampere A100 GPU in 2020, we're all aware of   the hockey stick growth that's occurred with  the size of the models and their intelligence,   which feels approaching that of a human. It's been a huge leap forward in a  

relatively short period of time, and  things have really seemed to hit warp   speed with the open sourcing of cornerstone  foundational models like Lama and Lama 2.  There's a huge community out there of  researchers, developers, artists, hobbyists,   all working furiously on this day and night. It's incredibly exciting,   and the pace of innovation is  hard-slash-impossible to keep up with. 

There's a whole world out there to dive  into and discover, and not just LLMs,   but vision language models and multimodality,  and zero-shot vision transformers alone are   a complete game-changer for computer vision. Thank you to everyone out there who's contributed   to this field in some way or another. Many have been at it for years or decades,   and thanks to them, it feels like the future  of AI just kind of, poof, arrived overnight. 

Well, the time's now for us to see the moment and  bring this out into the world and do some good.  Naturally, due to the extraordinary compute and  memory requirements, running these huge models   on consumer-grade hardware is fraught  with challenges, and understandably,   there's been comparatively little focus on  the work done in deploying LLMs and generative   models locally outside of the cloud, and  even into the field in embedded devices.  Yet in the face of it all, there's a  growing cohort of people doing just that.  Shout out to our local llama  and our stable diffusion.  At the edge, Jetson makes the perfect platform for  this because it has up to 64GB of unified memory   with Jetson AGX-ORIN and 2048 CUDA cores,  or 275 Teraops of performance in a small,   power-efficient form factor. Why bother though? 

Why not just do it all in the cloud? Well, the reasons for that are the same that   they've always been with edge computing. Latency, bandwidth, privacy, security,   and availability. One of the most impactful   areas that underlies the other applications  shown here is human-machine interaction,   or the ability to converse naturally and have  the robot autonomously complete tasks for you.  As we'll see, you really need to be  geared for latency, especially when   real-time audio and vision are involved, and  of course anything that's safety critical.  It would also just seem good  to know how to run this stuff   yourself while keeping all your data local. And fortunately, there's a massive compute  

stack openly available for doing just that. We've been hard at work for a while now on Jetpack   6, and it's easily the biggest upgrade that we've  done to the Jetson platform software architecture,   which is now underpinned by an upstream Linux  kernel and OS distro that you can choose.  And we've decoupled the version of  CUDA from the underlying L4T BSP   so that you can freely install different  versions of CUDA, CUDNN, and TensorRT too.  We provide optimized builds and containers for  a large number of machine learning frameworks   like PyTorch and TensorFlow, and now  all the LLM and VIT libraries too.  There are world-class pre-trained models  available to download from TAO, NGC, and Hugging   Face that can all be run on Jetpack with unmatched  performance, along with edge devices like Jetson.  We're bringing more components and services  from Metropolis to Jetson for video analytics   with DeepStream, and we just released Isaac  ROS 2.0 with highly optimized vision gems,  

including SLAM and zero-copy transport between  ROS nodes called Nitrous for autonomous robots.  Jetpack 6 will be out later this month  and supports Jetson-Orin devices,   and going forward should provide a much  easier ability to upgrade in the future.  With that, let's dig into the actual models  that we're going to show you how to run today.  First up are open vocabulary vision  transformers like CLIP, LVIT, and SAM   that can detect and segment practically anything  that you prompt them for using natural language. 

Then LLMs followed by BLMs or vision language  models, multimodal agents, and vector databases   for giving them a long-term memory and  ability to be grounded with real-time data.  Finally, streaming speech recognition  and synthesis to tie it all together.  All of this we're going to  run on board Jetson Orin. 

So we've optimized several  critical VITs with TensorRT   to provide real-time performance on Jetson. These have higher accuracy, they're zero-shot,   and are open vocabulary, meaning they  can be prompted with natural language   expressing context and aren't limited to  a pre-trained number of object classes.  Clip is a foundational multimodal text  and image embedding that allows you to   easily compare the two once they're encoded  and easily predict the closest matches.  For example, you can supply an image  along with a set of complex labels and   it'll tell you which label is the most similar  contextually without needing further training   on object classes, meaning it's zero shot. CLIP has been broadly adopted as an encoder   or backbone among more complex VITs and vision  language models and generates the embeddings   for similarity search in vector databases. Likewise, ALVIT and SAM or segment anything   are also used CLIP underneath. AlVIT is for detection,  

whereas SAM is for segmentation. And then over here, EfficientVIT   is an optimized VIT backbone that can be applied  to any of these and provide further acceleration.  Again, these are using TensorRT and  it's available today in Jetpack 5 and   of course will be in Jetpack 6 as well. So, let's dig into some of these.  We have a demo video here showing  the capabilities of Alibite VIT.  So, you can see you could just type in what you  wanted to detect and it will start producing   those bounding boxes should it find them. Previously, you would have had to capture   your own data set, annotate it, train  a model like SSD, MobileNet, or YOLO. 

on your training data set and it would have  a limited number of object classes in it.  Well, you know, Alvit was based on clip  which just has a huge amount of images   and different objects in it so you can query it for practically anything here,   and it's a real game changer not to have to  train your own models for each and every last   detection scenario that you want to do. This is a really impressive technology   to be able to deploy in real time on Jetson,  getting up to 95 frames per second on AGX-ORIN.  So not only that, but when you combine  the detections from LVIT with a secondary   clip classifier, you can do further  semantic analysis on each object.  So here you see within each Detection ROI, it's doing further sub-detections.  In brackets means you want it to use  ALVIT, and in parentheses that we'll see   in a second, you want it to use CLIP. So this is not dissimilar to primary,  

secondary detection pipelines that we've done in  the past, but in this case, it's all zero-shot   open vocabulary and much more expressive. So you can see here not only is it detecting   these different objects, but it can  also classify them individually.  So happy face, sad face, other  types of things like that.  And you can perform some very powerful detections  and subclassifications this way with just writing   a simple query on your video stream. No code required even.  So all the code for running LVIT in  real time is available on GitHub here. 

It's called the NanoL project because we've  taken the original LVIT models and optimized   them with TensorRT, which is how we took it  from, I think, 0.9 frames per second to 95.  And there are various different backbones that  you can run different variants of Alvi IT that   have more accuracy, less accuracy,  higher performance, lower performance.  And this also is a very simple Python  API where you just put in your image,   put in your text prompts, and it spits out the  bounding boxes and their classifications for that.  So I highly recommend that you check out  this project on GitHub, if nothing else,   that you take away today, because  object detection is still, by and far,   the most popular computer vision algorithm run,  and this can completely revolutionize that.  So the segmentation analog of this  is called SAM or segment anything   model and it works very similarly. Basically you just provide some control   points like click on different parts  of the image that you want to segment   and it will automatically segment all of those  different blobs for you no matter what they are. 

It used to be that you would have to manually  go and make a segmentation data set, train   the model on that, and those segmentation data  sets were very resource intensive to annotate.  But now, you know, you can click on  practically anything and it'll segment for you.  And when you combine it with tracking, there's  another project called TAM, or track anything, and   that can do segment anything over video for you. We also have containers for this available on  

Jetson that you can take and run today. So everything that I'm going to cover in   today's presentation is available online, what  we call the Jetson AI Lab, which is a set of   tutorials that we've put together that can  easily guide you through running LLMs, VITs,   VLMs, vector databases, the whole thing. And all these tutorials are backed by   pre-built Docker container images  that we have for all these packages.  And you can take those containers,  combine them, mash them together,   build your own custom versions, and build your own  applications with them and deploy them that way.  So it takes a lot of the guesswork and debugging  from porting what normally are GitHub projects   for x86 and discrete GPUs and porting those  to ARM and AR64 to be able to patch all that   and you don't have to worry about getting  Transformers library to run and so and so. 

And we have a lot of different packages  here available at your disposal and   some really awesome tutorials to follow. And you can chat with your own llama models.  You can run lava and all  the VITs that we cover here.  And also stable diffusion too. Stable diffusion is really a lot   of fun to do image generation with. Okay, so now we're going to jump into   the large language models. Here is a chart of the LLM   performance that we currently have on Jetson. This is for AGX Oren, and it shows a bunch of  

different APIs that we've exhaustively built tried and optimized with CUDA and it turns out   the fastest ones are currently AWQ optimization  out of the MIT Song Han lab and also the   MLC TVM project out of Carnegie Mellon. So these two are by and far the fastest.  Llama C++ is also very popular. We use that too in the community   but MLC is about twice as fast and you  can see here we can run everything from  Lama2 7B up to the full 7DB at interactive rates. And we just showed the Lama models here,   but you can pick basically anything  from Hug and Face that you want to   run and run it through these and do quantization. So these are all running with 4-bit quantization,   so they're much more memory efficient,  which is how we get them to fit.  in the memory capacity because the weight  alone for 70B can be up to, you know,   140 gigabytes if you're doing FP16. But if you 4-bit quantize it,  

you're down to like 35 gigabytes or so. So here's a video of me chatting with Llama270B.  Hi Llama, how are you today? I'm doing well.  Thanks for asking. How about you?  Is there anything you'd like to chat about or ask?  I'm here to help with anything. I actually gave you a voice with text   to speech and automatic speech recognition. So we are actually talking to one another.  Wow. That's amazing. 

I'm glad to hear that we can  communicate in a more natural way.  How's the voice synthesis and speech  recognition working for you so far?  Is there anything you'd like to chat about? Oh, it's working really well.  You're running here on an  NVIDIA Jetson embedded board?  That's interesting. NVIDIA Jetson boards  

are powerful and efficient, making them  a great choice for Edge AI applications.  Are you using any specific software or frameworks  to run the AM models on the Jetson board?  And how are you finding the  performance of the board so far?  That's funny that you already know about it. Yes, you are running on the GPU in real time.  That's impressive.  So this was using Riva for the ASR and  TTS, and runs in real time on Jetson AGX.  As we'll see in some later videos, the  smaller models, they generate text way faster.  But it's just like a lightweight  web GUI that I put together with   WebSockets that's geared for low latency. And it takes the mic from your web browser,  

transmits it to your Jetson locally  over WebSockets, and then sends the  TTS audio back to it. So obviously, being at the edge,   most of these Jetson-based embedded systems have  cameras or other vision sensors attached to them.  And as such, everybody in the community is very  interested in vision language models like Lava,   MiniGPT4, and a lot of these multimodal  embeddings that are coming out.  And how they all work is essentially they use an  embedding model like CLIP that will combine text   and images into one common embedding space where  contextually, you know, concepts are very similar. 

So if you have a picture of a dog and the word  dog in this multidimensional embedding space,   those two vectors are found very  similar location to each other, i.e.  they convey the same thought  or sentiment to the LLM.  And then after that embedding is complete, in the  case of Lava, it uses literally the same clip in   VIT encoder that we mentioned previously. There is a small projection layer that maps   from the clip embedding dimensions  into the Llama embedding dimensions. 

And then they also fine tune the Llama model to be  able to understand those combined embeddings more.  And what we found is if you use the larger clip  model that uses 336x336 resolution instead of   224x224, it's able to extract much smaller  details and even read text out of images.  And there are lots of other  image embeddings out there.  It's a very active area of development,  like ImageBind, for example, which   combines way more than just images and text. That can do audio, inertial measurement units,  

point clouds, all types of  modalities that the LLMs can   be trained on and be able to interact with. Essentially, what we're doing is enabling   the LLMs with all of the different senses so they're able to assimilate a holistic   worldview and a perception world model so  they can greater understand things rather   than just have to do it all through text. So you can see here the performance of   the Lava models is very similar  to that of the base Lava models.  It is actually the exact same model architecture. It's just a smidge slower, a few tokens per second   slower, because it turns out that these image  embeddings are a lot of tokens, like 512 tokens   for a 336x336 image embedding, or I think it's 256 tokens for a   224 by 224 embedding. So, and all of those  

tokens come at the beginning of the text chat. So every one of those, you know, your text input   is at least 512 tokens long as opposed to just a  blank normal text chat starting at zero tokens.  So that's why it's just a slightly slower, but  otherwise it gets all the same performance.  So with the latest Lava 1.5 model that came out  that some of you might have seen, it's got some  

really exciting new features in it, including  the ability to output constrained JSON format.  So you can essentially tell it, you know,  detect certain objects or I want you to   pick out this and that and give it to  me in JSON so you're able to parse it   programmatically and actually do things with it. And whereas the VITs you needed to prompt with   specific things like I wanted to detect a  face or hands or whatnot, in this example   you can just tell it to detect everything and  it will spit out bounding boxes of all objects   or whatever it is that you're trying to do. And this will be very powerful for making   closed-loop visualizations that are  deployed in real-time embedded systems.  For example, smart traffic intersection  crosswalk monitors, or blind assistant devices,   or baby crib monitors, or anything that It is an open-ended problem that you want   to extract information from without having  to train all of your own models to do that.  In addition to things like autonomous  navigation, being able to query it on like,   hey, where's this path go? Are there obstacles in my way?  Are those obstacles dynamic? All types of things like that. 

And this is really exciting. Another point to point out   is this whole model, Lava 1.5, is a huge  improvement over the previous versions of it.  but it was still only trained  in one day on eight A100 GPUs.  So if you collect your own data set,  it's actually within the realm of   possibility to fine tune one of these  yourselves for your own applications. 

So here's a video demo of chatting  with Lava 1.5 on AGX Oren.  What kind of fruits do you see? In the image, I see a variety of fruits,   including oranges, bananas, grapes, and apples. How much do the peaches cost?  The peaches cost 2.50. How much do the oranges cost?  The oranges cost 1.50. What breed of dogs are in the picture?  There are two dogs in the picture. One is a husky and the other is a  

golden retriever. Where are they?  The dogs are sitting on a rocky  hillside, likely in a mountainous area.  What does this sign say? The sign says, no outlet private road.  What kind of trees are these? The trees in the image are pine trees. 

Where's the path go? The path is in the woods,   surrounded by trees and vegetation. What kind of bird is this?  The bird in the image is a robin. Who has the ball?  The boy in the black and red shirt has the ball. What game are they playing?  They are playing soccer. Can you describe the image? 

The image features a young girl  riding a brown horse in a corral.  The girl is wearing a red shirt and  is positioned on the horse's back.  The horse is galloping around the corral,  showcasing its agility and strength.  The scene captures the excitement and joy  of the girl's horse riding experience.  Here's a look at some of the example  code that powers these demos. 

And it's basically a lightweight  wrapper that I put around MLC and AWQ,   since those weren't supported in some  of the other LLM libraries out there.  And in addition to all the multimodal embedding  management stuff, which we'll talk about.  But it's got a very simple  API for text generation.  Essentially, you load the model, it will  quantize it for you if it's not already done.  You create this chat history stack and then you  can append either text prompts or images to that. 

It will automatically perform the embeddings  for you, depending on what data type.  those inputs are, and then it  generates a stream of output tokens.  So everything we do here is geared for real-time  streaming, so you can get that data presented   to the user as soon as possible. And then you basically just output   the bot response to the chat history too. Now, let's say you want to do your own chat   management dialogue, you can totally do that. You can just pass in strings of text to  

the model generation function. What you should be aware of is   that the chat histories work best when you  keep a cohesive stack going, because then   you don't have to go back and constantly  regenerate every single token in the chat.  For example, we know Lama 2 models  have a max token length of 4096.  But if you were to generate the full 4096 length  chat every time, it would take really long. 

Instead, you can keep that cached  in what's called the KB cache.  And if you do that between requests, then  the whole state of that chat is kept.  You can run multiple chats simultaneously,  but it's highly recommended  to keep the chat stack flowing as opposed to  going back and forth and just assimilating   it all from scratch every time. And the reason for that is because  

there are two stages to LLM text generation. The first is the decode or what's called prefill,   where it takes your input context  and has to essentially do a forward   pass over every token in there. And this is a lot faster than the   generation stage, but it still adds up. when you're talking full 4096 tokens here.  So you can see if we're running, you know,  Llama70B on a full 4096 token length chat,   it'll take 40 seconds to prefill that whole thing. That's before it even starts responding.  But if you only prefill the latest input, you're  looking at, you know, a fraction of a second.  That's typically like the dot dot dot that  shows up in the chat or like agent is typing. 

What it's actually doing is prefilling your  input into it before it can start generation.  So this is why managing the KV cache between  requests is actually crucially important   to keep a very consistent chat flow going. Likewise, here's a look at the token generation   speed and how that varies with the input length. That does decrease slightly as well once  

you get up into the higher token length. So that's something to account for as well.  So obviously a big concern with all of this is  what are the memory utilization requirements,   and that along with the token generation  speed are what really drive everyone   very heavily towards quantization methods. So a lot of these LLM APIs that we talked about,   like Llama C++ and others, AutoGPTQ, XLlama,  they have lots of different quantization methods.  You can go everywhere from two bits  to eight bits, or most of the time,  Below 4 bits you start to see degradation  in performance, but at you know Q4 A16   quantization I've not really seen any  difference in output from that which   is which is really good because it takes Lama  70B from being 130 gigabytes of memory usage to  Down to only 32 gigs and that is  much more deployable for Jetson   and likewise for the smaller Jetson is too. You can run Lama 2 7B on an O-Rin 8 gigabyte  

board or Lama 2 13B on the O-Rin X 16 gigabyte. And we can see that here that we have a whole   lineup of different Jetson  modules that you can deploy.  And each of these, conveniently,  has a typical model size that is   well-fitted to its memory capacity. So as I mentioned, the 7B models  

are a good fit for the Oren Nano 8 gigabyte. 13b models are a good fit for the orenx 16   gigabyte and so on and so you can basically mix  and match the level of intelligence that your   application requires along with its performance  and other swap C requirements like size, weight,   power and cost of the embedded system that you're  deploying and be able to pick the Jetson module   that's appropriate for you to deploy those. So a few slides ago, I showed some code   that was basically a low-level API for  doing text generation with the LLMs.  Once you start getting more complicated  and adding things like ASR and TTS and   Retrieval Augmented Generation and all these  plugins, the pipelines get very complex. 

And if you're only making one bespoke  application, you can absolutely code all of   that in one Python application yourself. In fact, the first version of that   Llamaspeak demo I did was just like that. But eventually, when you start iterating,   make different versions of that, let's say I  want to have one that's multimodal or I want   to do a closed-loop visual agent, you're going to  start having a lot of boilerplate code in there.  So I've written a slightly higher level API on top.  of that text generation function that you  can implement all these different plugins in.  It's very lightweight. It's very low latency. 

It's meant to get out of your way and make this  all easier instead of harder without sacrificing   even one token per second of generation speed. And with this, you can very easily chain together   all of these different text and image processing  methods and use them with other APIs as well.  So these are just two basic examples of  what the pipeline definitions look like,   and it can be a completely directed, open-ended,  multi-input, multi-output graph, and get some   pretty complicated setups going in there. So another cool aspect that this enables is   what we refer to as inline plugins, or  the ability for the LLM to dynamically   generate code for APIs that you define to it. For example, how is it supposed to know what   the time is, or do internet search queries, or  what the weather is, or perform actions like   turn the robot left or right? All of those core platform   functions you can define in the system prompt and   explain the APIs to the LLM, and when it  needs to, it'll dynamically call that.  This is good to do on top of retrieval augmented  generation, which we'll talk about in a second,   because it doesn't just retrieve  based on the user's previous input. 

It can do so as it's currently generating  the output and insert that into the   output as it's going on. That is a good benefit of  you know, maintaining lower level access to APIs  as opposed to, you know, just going back to the   cloud for everything because you need the ability  to stop token generation, run the plugin and then   insert that into the output and then continue  the bot output in addition to doing things like   token healing or implementing guardrails  and guidance functions, things like that.  All of that is very good  to have very good granular  access to the token generation so you can stop  it right when you need to and then restart it   completely asynchronously so you don't destroy  all of the low latency pipelining that you have.  This is just a basic example here of  when I was toying around with this   to see if it actually could generate it. And this was just with the Llama 2 7B using   its baked in code generation abilities. And this is like basic stuff for it.  I do recommend using JSON format, even though it's more verbose   and results in more tokens, especially if you have  functions that have multiple parameters with them,   because JSON allows it to keep the order of the  parameters straight, so it doesn't confuse them. 

If you don't really have parameters or just  very simple, like the example shown here,   you can just do like a simple Python API. And there are other  Open-source framework plug-in agent frameworks  that you can use like Lang chain and hugging   face agents and Microsoft Jarvis that do  this as well They're not quite as geared   for low latency low overhead, which is like  why I've gone and done this but At the same   time those can do very similar things to In fact, the Hugging Face Agents API has   the LLM generate Python code directly, which  then it runs in a limited sandboxed interpreter.  So it's able to interface directly with Python  APIs, which is really cool and is not hard to do. 

In this case, I prefer just to keep it JSON or  text and manually parse that and call the plugins.  So, you know, it's not  having full access to Python.  So I've mentioned retrieval augmented  generation a few times and that is big   not only in enterprise applications where you  want to index a huge amount of documents or   PDFs and be able to have the LLM query against  those because remember the context length is   max of 4096 with LLAMA2 or there's a lot of you know, rotary embedded encodings that go   up to 16K or 32K or even more. But there's always a limit.  And you might have hundreds of thousands of pages  of documentation that you want to index against.  And when we start talking about multimodal,  you can have huge databases of thousands   or millions of images and video  that you want to index against. 

And not all of that can be included. So what happens is basically you take   the user's input query and search  your vector database for that.  It uses very similar technologies  to LLMs and the VIT encodings.  In fact, clip embeddings are used  in a demo that I'll show you next.  But it essentially uses similarity search  to determine what objects in the database  are closest to your query. And it's a very similar concept to the   multimodal embedding spaces, how that all works. And there's some very fast libraries out there  

for this called FICE and Rapids RAP that are  able to index like billions of records and   retrieve them lightning fast based on your queries.  And those are very good libraries to use. And I've used those on Jetson here to do   a multimodal image search vector database. I basically made this demo to prove out the   abilities of the clip transformer encoder  and just to be able to understand what   I could actually query for retrieval augmented  generation before integrating that into the LLMs.  So you can see here, you can not only query with  text, but you can do image searches as well. 

And it's pretty advanced image  search, and it's completely real-time,   real-time refresh here running on Jetson. This is indexed on the MSCOCO dataset,   275K images from MSCOCO. That took about, I think it was   like five or six hours to index the whole thing. But the actual retrieval search only takes on the  

order of like 10 or 20 milliseconds, which means  it's not going to add lag to your LLM generation   pipeline, which is very important because we  don't really want more than a couple seconds   of lag between the user's query and the response,  especially if it's a verbal communication ongoing.  So here's a chart of the retrieval  augmented generation search.  based on how many items are in your database. And some of these databases can get really huge,   especially with like corporate  documents, things like that. 

At the edge, I think it'll be smaller because you  only have so much space available on the device.  But you can see here, it's just on the order  of milliseconds for most people's applications.  And I also break out different  embedding dimensions here too. 

So some of the higher end  embeddings like ImageBind use  every single image or text as a 1024 element  vector that describes it in this multi-dimensional   embedding space. Clip uses 768.  Clip large uses 768. So then that was the one shown in the demo here.  So this scales very well. I think it would be pretty   rare that you would get up to 10 million entries on an   embedded device like Jetson. But hey, if you're doing lots   of data aggregation, have 30 HD camera  streams coming on, it's entirely possible.  And it's still only on the order of a  fifth of a second or less to do all that,   which is completely reasonable. All right, so tying it out with Riva  

to coordinate how I actually made these demos with  the text-to-speech, Riva is an awesome SDK that's   openly available from NVIDIA that incorporates  state-of-the-art audio transformers that we've   trained along with TensorRT Acceleration. And it's completely streaming.  It does streaming, ASR, and TTS. You can do 18 ASR streams in real time or 57  TTS streams in real time on Jetson AGX Orin and  nobody's really going to do that many streams at   the edge unless you have like multi-microphone  devices or set up something like that.  But what it does mean is when you're  only doing one stream, you're going to   take like less than 10% of the GPU to be  running all that, which is great because   that means that our LLM token generation rate is only going to go down by less than 10% there.  Because the LLMs, they will consume 100% of  the GPU, all that you throw out with them. 

And Riva has lots of   different ASR and TTS models that it ships. It also has neural machine translation and I've   seen some people do some really cool demos with  this where you can do live translation between   different languages and it turns out that a lot  of the LLMs like Llama are trained in English  There are some LLMs out there that are  multilingual, but if you're working with   an LLM that's trained in English, but you want  to be able to converse in other languages,   you can use neural machine translation in your  pipeline to essentially translate between ASR,   the LLM, and then the LLM output back into the  language that you want before it goes to the TTS.  Another really cool thing that Riva has is  these what's called SSML expressions for TTS.  So you can speed up or speed down or  change the pitch or add in like emojis   or laughs or all types of cool things  to make the voice sound more realistic.  And overall it sounds really good for  just being done on an edge device locally. 

None of the demos that I've showed  you so far rely on any cloud compute   or off-board compute whatsoever. You can run these entirely without   any internet connection once you download the  containers or have your application built.  Here is a pipeline block diagram  of essentially what the interactive   verbal chat management looks like. And it turns out there's a lot of nuance   to live chat back and forth, mainly the ability  to interrupt the LLM while it's outputting.  We know these LLMs, they like to  talk and they will just keep going. 

And you can instruct them to be very  concise in their output, but in general,   they will like to ramble on a little bit. And it's important, like in the video, to be able   to speak over them and have the LLM either resume And if it turns out you didn't want to query it   or, you know, stop itself when  you ask it another question.  And the best way that I found to do this is  a multi-threaded asynchronous model where   there's just a bunch of queues, everything is  going into these queues and being processed  and you need the ability to interrupt and  clear those cues based on things that happen.  So for example, the Riva ASR, it outputs what's  called the partial transcript as you're talking.  Those are in the videos where you see the  little bubbles come up and it's always   changing because it's constantly redefining  with beamforming what it thinks you said.  But then when you get to the end of a sentence,  it does what's called the final transcript. 

And that final transcript is what  actually gets submitted to the LLM.  But if the partial transcript starts  rolling in and you speak more than   like a handful of words, it will pause the LLM. And if it doesn't get any more transcript in like   a second or two, the LLM can resume speaking. If it does go final, if the ASR transcript does  

go final, then that previous LLM response is  canceled and a new one is put in its place,   which is important because then you don't want  it to keep spending time generating an old one.  when you are already on to  answering your next question.  So there it turns out there is a lot of nuance  in here and we tried to do it with the least   amount of heuristics as possible because that's  just lends itself to corner cases and in general   it's really nice and fun to be able to chat back and forth with these models.  And I really highly encourage you  to go on to the Jetson AI lab,   download these containers, start  playing around with the different   models and discover their personalities and  go from there, build your own applications.  And I think before long, we'll see these out  there on robots in real world embedded systems.  So yeah, Let's all do it together. 

If you need any help or support at any time, I'm  always available on GitHub, the forums, LinkedIn,   my email, DustinF at NVIDIA.com, and we're all  out there to help each other and keep it going.  So yeah, it's great to be part of this community. And thanks again so much for joining us today.  At this point, we're going to get set up for Q&A. So if you haven't already, please feel   free to type in your questions. If you are watching a replay of this,   yeah, feel free to ask questions another  way on the forums or GitHub or anywhere. 

All right. Thanks.

2023-11-29 15:37

Show Video

Other news