Deepdive into Gemma 3

Show Video

hi everyone my name is Cassidy today we've launched Gemma 3 this is the latest addition to our family of open- source models these are our most capable portable and responsible models yet last week we ran an early preview of Gemma 327b it on the LM Cy chatbot Arena and we are exciting excited to announce the Gemma 327b it ranks in the top 10 models on lmis this is significantly outperforming models over 15 times its size and truly setting a new frontier for open on device small models it's an honor to be standing in front of you all today introducing Gemma 3 and sharing the exciting performance achievements that we've worked on for the last several months the Gemma 3 family of models is designed for the everyday developer this is something which should be easily accessible to each and every one of you the Gemma 3 family of models comes in four different sizes r1b is a lightweight text model which is ideal for small applications r4b balances flexibility and performance with additional multimodal support r2b has strong language capabilities and is designed for more complex tasks and last but not least our 27b this is our most sophisticated model yet offering a significantly expanded capacity for handling intricate tasks and delivers an elevated level of performance for demanding applications for each of these models we have released both pre- and post-train checkpoints for each of these checkpoints we have released four different versions quantized for B flat 16 float 8 int4 and q40 the Gemma 1B is the smallest model of the Gemma family this is geared towards on device cases on device use cases which is supported by having less than a 700m memory footprint when quantized for q40 this is a Texton model which specializes in English and multilingual performance while still having impressive stem performance this is a 32k context length model which is a significant improvement from the Jamma 2 family of 8K context length we have several other big advancements with the Gemma 3 family specifically four big Tech improvements across our 4B 12b and 27b models we have a 16 times longer context length now supporting 128k tokens we have expanded multilinguality where Gemma 3 now supports over 140 different languages Gemma 3 is a multimodal model with interleaved image and and text on input while maintaining being a Texton output model lastly we have also added support for function calling this allows you to create a natural language interface for a programming API without needing to write any code this gen this allows you to generate programming calls as part of an AI agent workflow through this programming interface you can control the Model Behavior and output it would be impossible to stand on this stage today in Paris announcing Gemma 3 without highlighting the sign significant advancements that we've made for multilinguality across pre and post training we have overd doubled the amount of multilingual data which we are training on this has been made possible by using a unimac sampling distribution in selecting the languages which we are training on this has allowed Gemma 3 to represent over 140 different languages and now one of the exciting things with Gemma 3 is our ability to be multimodal natively Gemma 3 supports interleaving text and images on input while Gemma will output text this has been made possible by our vision encoder which is a Sig lip variant which has been tailored to our specific needs later today aishwara and Jeffrey will share more on the multimodal capabilities of Gemma and what you can build with this but right now I'm excited to introduce Shrea who will share more on the architecture of Gemma pre-training and how you guys can get started with all of this thank you hi I'm Shrea I'm a research engineer on the Gemma team like the previous versions We are releasing pre-trained checkpoints for all sizes of the JMA 3 family we have updated models in the three size categories from Jemma 2 for all of them we maintain or improve performance on text metrics across different capabilities like code math reasoning factuality and multilinguality the V3 improvements over V2 are shown in red in the slides in addition we introduce a new size range the 1B model which is great for resource constrainted users and demonstrates excellent English and multilingual performance our pre-training recipe remains similar to Gemma 2 with some updates to Aid the new capabilities that we are adding we upgrade to a new tokenizer specifically aimed at improving multilingual performance our 4 12 and 27b models are multimodel from scratch trained with the Frozen Vision encoder so the model can understand and answer questions about a variety of images ranging from natural images to pictures of documents and charts Gemma 3 can support four times larger queries Gemma 2 was trained at 8K context length in contrast Gemma 3 is trained at 32k from the start in the final stages of training we further extend the context length allowing the model to support up to 128k tokens opening up so many possibilities with the long context capability we can now pass in several Pages worth of documents and images to the model and perform different tasks on it like summarization retrieval and question answering now let's see what architecture improvements are needed for this in Gemma 2 we had introduced inter leave detention in our model alternating between local and Global layers the local layers had a sliding window of length 409 six meaning that every token attended to this many tokens that occurred before it for Gemma 3 we update the attention pattern and have five local layers for every Global layer so starting from the first layer we have five local layers followed by one Global layer and so on further we update the sliding window size to 1024 and decrease it even further to 512 for our smallest 1B model these updates greatly reduce our memory footage print at inference Time by decreasing the size of the KV cache so we can use the model at longer context length without facing a significant increase in memory requirements like the previous versions our models continue to use rope position embeddings to improve performance at longer context length we update the Rope config that we use the frequency for the global layers is increased to 1 million while keeping it the same at 10,000 for the local l in the Final Phase of training we re scale the rope embeddings with a factor of eight helping the model generalize to longer sequence lengths to test out the model I collected some of some Snippets from my favorite books and passed it to the instruction tuned Gemma 327b model and then proceeded to us some questions the Snippets were from classic books like Little Women Pride and Prejudice A Tale of Two Cities A Christmas Carol and more the first question was a simple one asking the model to give the time when the fourth snippet takes place and what the name of the main character was the text was from A Christmas Carol and the model is able to correctly answer the asked question note that to figure out which book is the fourth the model has to really get what the text is saying it cannot answer this question without without attending to the text then I give it a bit more nuanced question asking it to give three citations from the text describing BET's nature the question is about the book Little Women the model is able to correctly identify which book this question pertains to and is then able to get examples and their interpretations basically the model knows where to look at in a big chunk of text so it can understand and answer all sorts of questions next up Leo will talk about post training in Gemma 3 hi everyone uh my name is leonar I'm a research scientist uh in Google deep mind uh and I work in the post trining team of Gemma and for those who don't know that's how I like to think about the difference between pre-training and post-training pre-training is about encompassing and compressing The Whole World's information right he trains on a massive amount of data you know all of that right uh post trining is more about shaping the model's behavior for all capabilities for all abilities it's about picking Its Behavior its personality in some sense um and for Gemma 3 I can tell you we made sure this behavior is good on a really wide set of task uh you have here like the comparison of Gemma 2 and Gemma 3 equivalent sizes on a wide set of task code uh chat uh math and as you can see like our new 4B outperforms by far the previous 2B uh 12 outperforms by far the previous nine and 27 by far the 27 and I can even tell you that somehow our Gemma 34b is almost on par with Gemma 227b making it already very very good model at the 4B size and of course if this is the case you can extrapolate our 27b is extremely good too and these Benchmark numbers also translate in how our users and how developers actually like the model uh and you know Tris already said it and cassid repeated it but it's quite an amazing number you have here actually we got we gain one more point of Elo last night so it's now 1339 really important yes yes yes please Round of Applause uh now maybe just to put this into context uh the only open model above us uh on the LMC arena is a 600 71 billion mixture of expert right that's thinking with thinking enabled right because with disabled we're also on top of it um so like it's an impressive result I I really can't wait to see where the 12b the you know 4B and 1B will land we don't we don't know we're waiting for it just like you uh but you know the same recipe was applied and we're pretty confident uh it will be also really nice trade-offs with these smaller models um yeah to get the most of these models I'm going to come up with a couple of advice on how to use them uh pre-training models pretty easy right they do next token prediction you sample you start with beginning of sequence token and you you stop sampling when it gets to the you know end of sequence token quite classical with instruction tune models there's a bit more work there uh you just need to use the correct formatting uh sequence right uh and you have everything in the tech report but just so you remind this this is a kind of formatted discussion you don't forget also the the beginning of of sequence token and then you have these turn markers between the user turn uh and the model turn and you stop sampling when you hit the end of turn and not the end of sequence token so slight difference when using uh and when I mean using either sampling from it or even fine-tuning it you have to adapt uh to whether you're using a pre-train model model or an IT model all details are in the technical report so you know please refer to that when when trying to to use these models um maybe last is you know a bit more on the prompting itself uh we really tuned these instruction tune models to be good on a wide variety of task we really looked at tons of metrics uh the the team spent hours Vibe checking the models on a super large set of abilities so like just be specific when you want something specific if if you only want out you know python code just say only output python code right uh if you want only yes or no just ask enter the following questions by yes or no right just be as specific as possible and you'll see that Gemma 3 is much stronger than Gemma 2 at instruction following overall um be it for you know respecting Jon formatting or whatever you developers want to use Gemma for if you're interested in you know system instructions uh there's no special formatting for them just put them in the use user in the first user prompt and Gemma we made sure that Gemma will follow the system instructions throughout the turns of the discussions and you know it has been mentioned to you before uh but Gemma 3 has a much longer context length than Gemma 2 and you should make the most of it because in post training we also made sure that Gemma is really good along you know a very long set of turns so don't hesitate to rely on the multi-turn abilities of JMA 3 right it's really good with that um yeah uh thank you very much and I let the stage for a morning everyone um I'm aaria kamat I'm a research scientist on the JAMA team previous versions of Jama like Jama 1 and JMA 2 were text only so the as some of you may know the P Gemma team worked on patching in multimodal support after the fact Gemma 3 that we just launched today expands its capabilities to include Vision understanding from the start while simultaneously preserving or enhancing in performance across a broad range of text skills such as code generation factuality reasoning math and multilingual processing to start I'd like to highlight some differences between the previous generation of multimodel models built using Gemma and the new natively multimodel Jemma 3 the first is that pjama provides a separate model for three input resolutions on the other hand Gemma is one model that supports any resolution pjama models are geared towards task transfer through fine-tuning on specific examples Gemma 3 on the other hand is great at zero short performance and can be used out of the box on a wider variety of tasks without fine-tuning on any uh on on the specific examples pjama is also not instruction tuned while Jama 3 models are what this means is that the model is able to better understand user intent leading to responses that are more accurate and relevant pjama also does not support fully interleaved image and text while JMA 3 does this enables more natural interactions through multimodel chat using everyday language and this makes the experience more user friendly in case you are interested in fine tuning on Downstream task Gemma 3 pre-trained checkpoints are a great based model the 4B model is 10 times cheaper to transfer while outperforming the larger size pjama models on high resolution task this is thanks to the more efficient image and coding used in JMA 3 at this point I just like to remind you that all of these expanded capabilities do not come at the expense of text performance and you've seen the ELO scores already so it's pretty cool that we can achieve all of this extra stuff without uh hurting text so now that we've seen what the model can do let's take a look at how we do it here's an example of how multimodel inputs are processed by Jam 3 in this example we have an image of a sea otter out shopping in a mall and a prompt asking to describe the image the image is square resized to 896 by 896 resolution and our vision encoder which is a sigp variant processes the image to produce visual tokens these are then average pooled to produce 256 tokens per image the image tokens are flattened and passed to Jamma 3 along with the text prompt to produce the output in an autor regressive manner we follow up asking Jemma to read the calculator held up in the image and Jemma with its excellent OCR capabilities provides us with an answer here I've Illustrated the attention pattern used in Gemma 3 to support interleaved image and text the attention is fully causal on text and bir directional on images we evaluate Jemma 3 on a wide range of domains and as you can see here Gemma 3 achieves great performance on reasoning math document understanding and video understanding often outperforming or matching models that are much larger in size while Square reiz suffices for most images seen in the wild as you can see here it severely limits readability for images that have skewed aspect ratios we use the following approach to process such images we pan over the image and extract crops each crop is then resized to standard input resolution and fed to the model minimum crop size and number of crops are configurable and decides how much the model can zoom into the image with this approach we are able to train at a single resolution and evaluate at any resolution while Gemma 3 already has impressive performance on many image understanding benchmarks panon scan allows us to effectively handle images with varying aspect ratios providing boosts on document related tasks or tasks that involve reading text and reasoning on images Gemma 3's multimodal chat capabilities combined with its multilingual proficiency better reasoning and math abilities and a 128k context length paves the way for a multitude of diverse and uh Innovative applications jafre who will be presenting next will go more in depth with a few examples bringing all of these capabilities together thank you hi everyone I'm je I'm a resarch engineer at Google deep mine where I work on the post training of GMA for this new generation of GMA models we added image prompts in our instruction tuning recipe in order to make GMA 3 it really to use multimodal generalist model now we can have back and forth conversations with Gemma using images so let's dive into few examples to see to explore what we can do with Gemma first one is like a reasoning example so we drew in a white board a rectangle where we added some measurements and also a question mark and we ask Jemma what is the question mark in the image Jemma understood our hiden intent which was to compute the value of the diagonal and proceed by Computing it then we have a cool use case where we want to go from images to latch formula so we take we took a screenshot of math equation and we as Gemma can you transcribe this equation into latch formula this was easy work for Gemma and which outputed the the latch formula plus some nice explanations of the for of the math formula then we wanted to try multi turn and also multilinguality so we took a picture of a bus ticket from Prague and we asked JMA how much is this ticket JMA understood that the ticket cost 870 check Runner so then we ask another question how long is this ticket valid so JMA understood that on the top left corner of the bus ticket you had like a clock sign with a 24hour next to it and it made the guess that it that this sign would mean that the ticket would be valid for 24 hours which was inde the case then for our last example we have a simple plot with with two functions and we ask a bunch of questions to GMA first one pretty simple what is g so GMA understood that g is the horizontal line at yal to 1 then bit more tricky question what about F so here you have to make some guesses uh by looking at how the curve behave and actually Gemma made the right guess that f is Ln x + 1 minus one and then gave some explanations on why it decided to have to choose this approximation then we ask point would this function cross Gemma relied on past context to understand what we were talking about and started to solve the equation needed to answer this question then may be a pring question for us what about the derivative of f so already computed F so now it just proceed by um by Computing the derivative of f without needed to to retrieve or what is f and then last thing we wanted to make sure that JMA didn't forget about the image we gave at the very beginning of our exchange and so we asked what is the color of G it was easy work for JMA as well um as it found the right answer orang so this four examples are just a tiny sample of what we can do with Gemma we are really excited to release Gemma to the public and see all the cool use cases you will come up with thanks and now I'd like to welcome Robert hey everyone hey everyone my name is Robert asashi I'm a research scientist in the Gemma team so Google's mission is to organize the world's information make it universally accessible and useful and the Gemma models are completely aligned with this Mission they're small they can run on device their uh open weights and they're extremely capable it's still our responsibility to make sure they're widely accessible by making them really good at multilingual so for Gemma 1 and Gemma 2 we already have uh pretty strong multilingual abilities but we still decided to literally double down on the weight of multilingual data for both pre-training and posttraining we also updated the tokenizer because a large language model can only be as good as it as its underlying tokenizer and during development we set ourself a high bar on multilingual evals speaking of post trining evals they usually based on side by-side evaluations so you take a set of prompt of a specific language you generate one answer for Gemma and one answer from a baseline model and then you ask a usually bigger and more capable model to give a preference core between the two answers so usually it goes from negative 1.5 to positive 1.5 where the negative number means that you prefer the Baseline and the positive number means that you prefer Gemma here are some numbers here we compare Gemma 227b to GPT 40 and Gemma 327b to GPT 40 for you to understand what is numbers mean we usually consider that plus 0.05 is a significant Improvement and we

see that for German we go all the way to plus 0.25 compared to GPT 40 in Spanish plus 022 and in Japanese a plus plus 003 in fact one minute before this talk I just learned that on the lmc's chatbot Arena the Gemma 3 27b Model is now leading all open weights model in French and in [Applause] Spanish now we're in French so we speak French um here's here's an example that I really like it's personal what this means is that I was born in the ' 90s in France uh I'm a football fan uh according to you who's my childhood hero answer by giving his nickname and his number with a highend between the two obviously the answer is zizu 10 now the reason why I really like this example is that um first of all the model needs to understand French then it needs to have some cultural context right it needs to know what happened in the '90s in France which has a link with football the World Cup 98 he needs to know what was the most popular player on the team zenan zidan he needs to know what was his number number 10 and then he needs to understand the the formatting requirement of this prompt which is to Output uh his nickname with the hyphen uh between his nickname and and his uh number now with this great example at I'd like to thank you oh wait sorry there's this slide as well um so we know that Gemma 3 is is pretty great but it's really our mission to make it even better uh at multilingual uh abilities it it's one thing to serve uh 140 languages but there's still so much more that we can do to improve uh Gemma now I'd like to thank you all for attendance we're uh on behalf of my teammate we're so excited to see what amazing things you will build with this new generation of of Gemma models [Music]

2025-04-06 06:57

Show Video

Other news

Salesforce to Buy Informatica, Apple’s Tariff Headwinds | Bloomberg Technology 5/27/2025 2025-05-29 12:47

Data Analysis for LinkedIn Success: Leveraging Metrics to Grow Your Network 2025-05-27 19:06

John Roese, Dell | Dell Technologies World 2025 2025-05-24 00:43