Fine Tune DeepSeek R1 | Build a Medical Chatbot

Show Video

So let's talk about deepseek. If you've clicked on this video, you're probably already aware of what deepseek is, so I won't bore you with a long winded explanation. But if you want more information, I highly recommend that you check out this blog, which I linked into the description below. In a nutshell, though, Deep Seek is a large language model developed by the Chinese company Deep Seek, which rivals OpenAI's old one in reasoning performance, but at a much cheaper price. I'm talking much cheaper and Deep Sea claims it only costs $5 million to train and it's open source. So overnight we went from having one strong reasoning model that is closed source pretty expensive to thousands if you count the fine tuned versions of Deep Seek that are already on hugging face that are open source and are much cheaper.

This, and the pretense that you do not need large amounts of compute and GPUs to train a reasoning level model wiped off about $1 trillion from the US stock market. Now, setting aside shortsighted investor sentiment, I still think we're pretty early in this deep sea saga. I've seen folks in the community claim that deep secrets lying, and that they have access to a lot more compute than they claim. I've seen others discuss that deep seek is pretty legit, and I've even seen use of OpenAI claiming that Deep Six, something called knowledge distillation to steal the outputs from OpenAI is models to make deep seek models smarter. So we're still pretty early in the story here.

We still don't really know what's going on. So I'm not going to pretend that I have a strong take here. What I do know, though, is that we have access to a pretty cheap, strong reasoning model that we can fine tuned right now, which is what we're going to be doing today.

Now, my focus throughout the video will be to give you as much intuition as possible over the fine tuning process. So if you have basic understanding of Python machine learning and a bit of knowledge of deep learning, I hope that this should be accessible enough for me. Success in this video is that you walk away, not necessarily understanding the math that's going on behind the scenes, but you have a basic understanding of the intuition in plain English of the fine tuning steps that we're doing today.

Just a quick one. Today's video walkthrough is a fork of AI with Ali, a one written tutorial on the same topic on the Datacamp website. Without him, this video would not be possible, so make sure to give him a shout out and check out his blog below. Now with that said, let's get started.

So I have a slide here that I'm going to walk you through a bit of the intuition of what we're doing today. Deep sig is essentially a large language model. So today we're going to be fine tuning an 8 billion parameter version of the basic. And essentially this is a really rudimentary representation of probably what deep SQL looks like behind the scenes. And essentially it's a large language model that takes in many, that has many hyperparameters, many weights and takes in as input a prompt and returns an output. And what we're going to be doing today is that we're going to be fine tuning it on a medical data set so that its outputs are much more consistent, and the style that we want them to be, and more accurate on medical reasoning data.

We're going to be fine tuning deep DPC on the medical or one reasoning data set. We're going to take a look at the data set shortly. So don't worry about that right now. And we're going to be using a technique called Laura for short. It's a low rank adaptation.

So Laura for short, to be able to fine tune deep, in an efficient manner. Now I'm going to give you a bit of intuition for Laura. I'm not going to go into the math behind it, but, I'll maybe I'll use an analogy here. You know, I'm a gamer. I love playing five games. And this is, a factory.

That Sony owns, that creates PlayStation fives. And as you can see, it's really well optimized for creating fives. And on the right hand side, I want you to think that we're looking at the factory from a bird's eye view, and each small circle or square here is a node optimized for doing one part of the supply chain. You know, here we attach the GPUs. Here we touch the CPUs, here we touch the ports, so on and so forth. Right.

And this is a factory that is really optimized for building PlayStation fives. But lo and behold, a few months ago, Play Sony released the announced the PlayStation five Pro, and now Sony has a, pick on its hand. It needs to optimize its factories to also create PS5 pros.

So you have two options here. Potentially. The first option is completely changed the factories so that they're only optimized for building PlayStation five. Pros. Now, the con here is that this is really expensive.

It takes a long time and we still want to be able to make PS5. So changing an entire factory, from to create from product A to product B is quite expensive and time consuming. The pros, though, is that by the time this process is done, this factory will be extremely good at creating PlayStation five. Pros. Now, another way that you can do this is part of the factory that matters really right? Because the PS5 Pro, has really similar components to the PS5, becomes more optimized for building, PlayStation five. Pros.

Now, the cons here is that this factory may not be as good at making PS5 pros as the first one. Maybe not be as efficient or fast at making PS5 pros, but it's pretty cheap to change your logistics. It does not take a long time, and the factory is still able to make PS5 in a lot of ways, this is what low rank adaptation does. It does not fine tune all of the weights in a deep learning model, but what it does that it fine tuned some of the weights that matter and some of the layers that matter to be able to improve the performance of a large language model on that particular set of, data that are included in the fine tuning, data set. Now, I linked here, we're going to link this in the description, a technical explanation of low rank adaptation by Sebastian Ruska, who is really one of the best NLP researchers in the game.

Highly recommend that you check out his content. There. But this is the intuition of Laura, and I won't really be going into the math behind Laura today.

In this video. Now, the packages that we're going to be using today are on SLAs, which is a, module that allows for efficient fine tuning and inference of, of large language models. Of course, we're going to be using hugging face because of all of its helper functions. To download the data of that fine tuning data set, download the model, Sam, and so forth. And we're going to be using weights and biases for tracking our experiment and seeing how it performs.

So with that let's get started. Now we're going to be using Kaggle notebooks here today. And Kaggle notebooks let you, use GPUs for free. I think you get around 30 hours a month.

So we're going to be using it today. And, to be able to set up your GPUs, all you need to do is go on to settings, go to accelerator and choose GPU, T4 times two. All right.

So again as I mentioned, what are the tools and packages we're going to be using today. We're going to be using UN sloth, which is for efficient fine tuning and inference. Specifically, we're going to be using these two modules fast language models and get that model.

I'll explain these in a bit. And what it allows us to do is to allow it to, to, adapt the model for low rank adaptation. And we're going to be using a few different Huggingface modules. So Transformers to work with the fine tuning data and handle different model tasks. Transformer reinforcement learning from hugging face allows for the, fine tuning, process.

And then, data sets lets us fetch the reasoning data set from hugging face. And we're going to be also using PyTorch, for a few, helper tasks, as well as weights and biases for tracking the experimentation. Now, before we get started, we need to have access to hugging face and the weights and biases API. So again, as I mentioned, first set your GPU accelerator, by going on accelerator and choosing the GPU, T4 times two. And then to get a weights and biases API, all you need to do is go to the Weights and Biases website, sign up.

If you haven't then go to your settings. So I'll go here. And then all we need to do is go to settings. And then you will need to go on API keys, reveal the API keys, you'll be asked to sign up, and then you'll have access sign in again sorry. And then

you'll have access to the API keys. So going back here now, to, in order to get your hugging face tokens, all you need to do is go to hugging face as well. Sign up if you haven't, and then go to the hugging face, profile that you see on the top right hand side. Press access tokens and go on access tokens and then you're good to go. I have my access token here already created. All you need to do is copy paste it.

And then in the Kaggle notebook, interface you need to go to add ons, go to secrets, and then add your hugging face tokens and your weights and biases token. So all you need to do here is press secret. This is the name that you give it.

So here I'm using HRF, underscore token and WMP. And then you're good to go. Now I'm also going to begin the process here by installing a few relevant packages.

So I'm going to install Unsworth and the latest version of Unsworth here. I'm going to import all my relevant packages throughout this walkthrough. So again from Unsworth I'm going to import fast language model. Don't dwell on the packages in the modules that we're using. Just read these as tools. It's fine if you're not an expert in any of them.

We're going to be, learning along the way. We're going to import PyTorch, from, transfer, transformer reinforcement learning. We're going to import the SftP trainer, which is supervised fine tuning trainer. From Unsworth import is Bfloat16. Supporters of this just checks if the hardware they're using supports bfloat16 precision data types.

Again, don't dwell about it. It's just a data type, technicality. Here, from Huggingface hub import log in.

This lets you log into the API, from transformer import training arguments. This lets us define the training hyperparameters for the fine tuning process. Again, we'll check this out. Don't worry about it. From data sets import load data set.

This lets us load the fine tuning data set that we're going to use. And we're going to import weights and biases. And then we're going to import Kaggle secrets from user secret client. So I already, run this code here. Right.

And you can see here, un sloth, has been imported successfully. Now, one thing we were going to create the API keys and log in to hugging face and weights and biases. So we have this code here that you can see. Right. From Kaggle Secrets import user client.

And then we define the user secrets module. So user secrets sorry a variable. So we're going to initialize the user secrets variable using the user secret client function. Again, this is, from Kaggle Secrets. Right. We're going to, create, the hugging face token by get the secret token.

Again, we're using the same name, of the tokens that are here. Right. So all that's all we need to do.

And then we log into weights and biases as well. So I log into hugging face token here. Right. So we're logged into hugging face. And then for weights and biases I'm going to log in. But then I initialize a new project called fine tune.

The IPsec R1 to still lemonade. Be on medical chain of thought dataset YouTube walkthrough short name and then job type equals training. So we're going to do a fine tuning process. And then anonymous equals allow. And that's essentially lets us log in here to to weights and biases and hugging face. Now I've already logged in here just to avoid running code too many times.

So with that, let's get started. And the first step that we're going to do here is we're going to load deep seek and the tokenizer. Right. So we're going to use from the fast language model module from onslaught. We're going to use the from pre-trained function. Right. And we're going to also configure a few key parameters, for inference and fine tuning.

Right. So just explain what are we doing here. Are we going to set the max sequence length right. That the model can handle as in the number of inputs. It can take a to 2048. This is usually canonical.

Here we're going to set data types to none. So dtype equals none because then the model will be able to auto detect the data type. And then this is really important. We're going to set to load and for bit equals to a true.

And this enables for bit quantization. So this is a memory saving optimization technique. Now I want to give a bit of intuition on for bit quantization. You know how you compress images from like 30MB to 4MB without necessarily losing a lot of the quality, but it's compressed. This is what we're doing here, right? So, we take the model weights instead of using a 32 bit or 16 bit, numbers, we compress them into four bit values.

Right. So this allows the large language models to run more efficiently on consumer grade GPUs, right, without needing massive amounts of money. And that's essentially what four bit quantization is doing here.

So what I'm going to do I'm going to first define the model and the tokenizer. So as you can see here and the solution code will be available to you. You don't need to worry about necessarily going along.

But if you want to get the intuition and you can get this here, I'm going to write fast language model, from pre-trained. All right. I'm going to open this here model name. I think it's I already have this here on my notes.

It's onslaught deep seek R1 distill llama et b. Make sure capital B. And this essentially is the model that we're going to be using today. It's a distilled version of deep SIC R1 that has 8 billion parameters. So it's a smaller version of deep seek R1. I'm going to set my max seq sequence length to that same parameter that you see here.

The dtype is equals to dtype. So the data type is set to default. So I'm sending it to none right.

Do I load in four bit. Yes I do load it. So it's true I enable here four bit quantization. Right. So I'm going to take it here.

And what is the token. I'm downloading this from Huggingface. So this is a wrapper function around the hugging face function. That lets us, inference the model and run the model, using on cloth right at a much faster pace.

Right. It's much more efficient. And what I'm going to use here is my hugging face token. Right. And then I'm going to run it. Let's see if it works.

Okay. So I think yeah we're downloading the model. It's going to take a few minutes. So we're good to go I think the tokenizer is there.

The model is there. Now what we're going to do is test the, AR1 model without any fine tuning on a medical, use case. Right. So first we're going to define a system prompt because, this includes like placeholders for, the instructions and the question we're asking and the response. And we're going to guide the model to think step by step. Now if you see here this think tag, this is the chain of thought.

So what makes deep seek R1. And oh one special is that these models have a chain of thought. They think before they perform tasks. So we're going to see the chain of thought and action.

So I'm going to define a system prompt here. And this prompt will guide the model to think step by step and provide a logical, accurate, answer. And we're going to be, running inference, on the model. So we're going to be asking a question, and it's going to be giving us a response, right, by providing it a medical question. Right.

So first we're going to define the question as you can see here. Right. So this question is a 61 year old woman with a long history of involuntary urine loss during activities like coughing, or sneezing but no leakage at night, undergoes a gynecological exam and a Q-Tip test. Based on these findings, what would a system most likely reveal about her residual volume and these contractions? Now, I cannot answer that question, so don't expect me to know the answer to that question. Right.

But nevertheless, we're going to we're just using this for illustrative purposes. Now first things, the first thing that we're going to do here, is I'm going to maybe first explain the process, right. We're going to define a test question right. Then we're going to format the question and using the structured prompt style right that we discussed. And then we're going to tokenize the inputs to get the input. And in the different tokens and move it to GPU for faster inference. Right.

And then generate a response using the model and then decode the output tokens back into text. Right. So, to obtain a final readable answer.

So going here first I'm going to enable optimized inference for on sloth. So I'm going to do fast language model whoops for inference I'm going to define model here. Right. And so on. Model you know for inference like I think onslaught is faster by two times than hugging face. Right.

And so this here we already define the model. We can use the model here. Right. So that's step one then step two I'm gonna format the question using the prompt style that we have. Right. So that it takes an image.

So this is and I use the tokenizers here, that we created. So inputs equals tokenizer prompt style format question. Okay. So here what I'm doing is I'm just formatting the question according to the prompt style. Right. And then I'm going to make sure that we return PyTorch tensors.

Right. And that we so that the we convert the input to PyTorch tensors here. And then what I'm going to do is I'm going to set two equals Cuda so that we, move this, inference to GPU here.

Now. I'm going to here generate a response right now. So the output is model dot generate. Right.

Just right. So what is the input that we're using. It's the input that we defined above. So input IDs.

First let's define that. Equal inputs the input IDs. Right. So here we're tokenized. These are the tokenized input questions. Right. Then

attention mask equals inputs dot attention mask. This is just for for the model to handle like, padding or like, empty data here. So don't don't really worry about this here. This is just, you know, one of the hype, like, one of the parameters that we need to adjust. Max new tokens equals 1200. So this limits the response, to 1200 tokens here.

Right. And then, I'm going to cash, I'm going to use cache equals. True. And what this allows you to do is, enables caching for faster inference.

So if we use the same question again, it will be able to, respond again, much more quickly. So now we're going to decode the response. So response equals tokenizer.

Batch decode equals outputs okay. And then what I'm going to do here I'm going to print response okay. So again what we're doing here we're decoding the outputs. Right. Using the tokenizer into our response. Right. And if I look at here the response will be essentially a key value pair.

So if I just do print response oops. What is the error that we have. We have return tensors. We did a, I think return tensors.

Okay. So if I print the response it will be a dictionary. Okay.

So, it's actually a list. Sorry about that. And then, you can see that we have, quite a lot of it's the text is pretty hard to understand. So what I just need to do is first get the first. The element of the like, the the inside, the elements of the list. And I'm going to split it on response. So if you go here you have question eight right.

And here is instruction. And then if I go down you see a think tank. Whereas the response should be somewhere here. Yeah it's over here. Sorry about that.

So the response is here. And then. I'm going to run this again. It should run more quickly now since we did use cache equals true. So let's see how it performs.

Oh I made a mistake here I forgot to get. Yeah I should have added here this one. So I'm going to actually so we don't run this again. I'm going to run this here. And yeah. So you get because we split on the response.

Right. And essentially all we need to get is the second element of the list, which is the actual response itself. Since, we split on the, on the list here, that you can see.

Cool. So now we have the response, right? So you can see the chain of thought. Okay. I'm trying to figure out what system it shows off.

It turns out breakage as a first rate volunteer. You're in laws, yada yada yada. IPsec is very verbose, as you can see when it's in chain store, right. And then, the provides an answer here. This is the most like reply.

Increase residual volume in the bladder due to incomplete emptying. So on and so forth. So we now need to fine tune deep six.

So why are we even doing that. Right. Because it over the successful chain of thought and it provided reasoning before the final answer. So if you see that you know the reasoning process here, one is really long winded. So we want it to be a bit more concise. Right.

And more importantly, we want the final answer to be consistent in a certain style. So with that, let's do a bit of work here to start fine tuning step by step. So let's get started.

You know, if I look at here step one which is updating the system prompt, right. So if you notice before before we didn't have the end sync tag, that we had here. So we're going to first add that to the ending think tags. We're just going to add slightly the the ending thing tags. For that we have a third placeholder. And I find training data set for the, chain of thought column.

Right. So all we need to do here is initialize this train prompt style variable, and it will it will be much clearer once you take a look at the, training data set. Now, I'm gonna also you I'm gonna download the final data set that we're going to be using. So if you look at here, this, data set that we'll be using today from, hugging face, essentially has three columns, a question, the chain of thought associated with the question and the response.

Right. So, you can see here and this particular, data set was, used to fine tune a model called what was Gpt2 one, which is a medical MLM, designed for advanced medical reasoning. If you want to cite it cited, cite the authors that you can see, here, now, the data set was constructed using GPT four as well, but has been verified, for, medical problems and accuracy.

Now what we're going to do here is first we're going to download the data set. All right. So I'm going to create data set equals load underscore data set. So the name of the data set is called I'm just going to copy paste it here. Freedom from intelligence slash slash medical oh one reasoning SFD right.

We're going to specify the language to English. So n right. And I'm going to do split train zero 500.

Now what this does, it just lets me get the first 500 rows of this data set. And I'm just going to add this parameter here of trust. Remote code equals true which is lets me download the data set. And remotely and then Cowherd built here data set.

As you can see we're downloading it's generating the train test split right. And it's essentially here this dictionary of data set features question complex train of thought response, so on and so forth. Right now if I want to see the first entry, the second entry of this data cells two data set, brackets one. Right. So the question here of 45 year old, man with history of alcohol use who has been absent for the past ten years, so on and so forth.

Right. Complex chain of thought. And then, the response that you see here below, right.

So the next step is I want to structure the fine tuning data set according to the train prompt style that we defined here. So instruction question response, chain of thought, and then response. Right. So this is something that's quite important because then each question is paired with a chain of thought reasoning. And the final response. Right.

It ensures that every training, example that the model sees follows a consistent pattern which helps in performance. Right. And then it helps the model from continuing beyond the expected response by, adding the end of sequence tokens. So I'll define what the end of sequence token here is.

It's pretty, simple. So EOS token. Right. So by the tokenizer has this attribute EOS token.

And I'm gonna define it here. Is this essentially a tag that says end of sentence. Don't generate any more content.

Right. Then I'm gonna define the formatting function. So what this does is that it defines, the data set that you just saw here according to this, prompt style. So, we're just doing, a bit of data manipulation here. So. So define prompts function.

Right. So what this function does is that it takes as input the examples, from the data set that we have we're gonna define an inputs right. Which is essentially the input for the model. Right. So from examples we're going to take the question right. And again the question is this right a 45 year old woman set it up.

And then we're going to take the complex chain of thoughts. So chain of thoughts here I'm going to define it into this a variable. And. I'm going to do complex cot and then outputs.

I'm going to also and I think it's response okay. Perfect. So here for every single example that we get we get a question, we get the complex chain of thought. We get the response.

And what I'm going to do is I'm going to create an empty list called texts to store all of these into one single prompt style format. So what I'm going to do is I'm going to iterate over the data set and then format each question, reasoning step and response according to this prompt style that you see here. And just put it in text, that texts, list that you saw here. Right.

So I'm gonna, zip my, inputs here. So for input chain of thought and output, and zip inputs, chain of thoughts, outputs. Right. So I'm zipping each example here. Right. And then what I'm going to do is I'm going to format, so as you see here I have the different inputs. Right. I'm going to format the inputs according to.

So I have the question input here, the think input here. And a response input here. Right.

I'm going to format them into this style of prompt. So prompt style dot format input chain of thought. Output plus the EOS token. So then the sequence token that you see here.

So again what I'm doing is that for each input the input gets formatted here. Then for each chain of thought the chain of thought gets formatted here. And then for each response the format, the response gets formatted here.

And then I'm adding at the end an end of sequence token. Right. And what I will do is then I will append this text to the text, to the text, list that I've defined. Okay. And then what I'll do is here, return.

Text texts. Right. And this really, returns the. Oh, I need to create the key, the name, the key here.

Right. And then. Right. So here hopefully this is correct. Let's test it out. Right I'm going to create a new function called Data set Fine Tune.

And I'm going to map the the function that we just created to the data set. So map formatting prompts funk. Then I set batch equals true.

So you can do it in batches. And then data set fine tune. Right. We should have a new text key here. Right. And I set a first. Right. So yeah,

we can see here it is. Correct. Below is an instruction that describes a task paired with an input. So answer right a response that properly question this is the question that we use in the original example. Then response thing tags. And then here is the response okay with an end of sentence sequence. Now what we need to do is setting up the model using low rank adaptation.

Right. So again I added an intuitive explanation here, of Lorac annotation. And essentially what low rank adaptation does is that instead of modifying all the weights of the model, it adds small, trainable adapters to specific layers right of the model. So again, go back to that PlayStation example.

You know, instead of rebuilding the entire factory to produce a new product, it adds small specialized nodes in the supply chain, to existing machines. This allows that quickly without disrupting the core structure. Right. Below we're going to be using the get best model function, which stands for parameter parameter efficient fine tuning. Right. This function, what it does is that wraps the model right with Laura modifications.

Right. So again don't don't focus on the jargon here. What this model does is it just let's, let's just what this function does is it just lets us adapt the model, in specific layers of the, of the model. So what I'm going to do is I'm going to create a new model called model. Laura. And I'm going to define from fast language model. So again if you remember right, we, we imported from fast language model get, from un slot and I'm going to use the get best model function again.

The input here, the first one is our model. So we're wrapping this model up with Laura modifications. We're going to set R equals 16 right. This determines the number the size of trainable adapters. So the higher means we're training we're adapting more weights.

The lower we're being more efficient. Right. So I'm going to use 16 here.

After some trial and error. Right. And then I'm going to define which, layers, the Laura adapters will be applied to. So again I won't go into the math here. But highly recommend that you check out our Transformers content.

And blogs and courses. I link to them in the description, because here what we're going to be doing, we're going to be defining a list of specific layers in the transformer architecture of deep sic that we will, be, that we will be adapting. So here Q proj is the query projection and the self-attention mechanism.

We have k proj is the key projection in the self-attention mechanism. Again, I don't want to go through the math. These are essentially target modules, from the attention mechanism. And, it's, essentially different parts of the transformer architecture that we are adapting here. Right? V proj, which is the value projection and the self-attention mechanism.

Again, I highly recommend that you check out the tutorials on these. O proj which is output projection from the attention layer. And then gate proj which is the feed forward layers here.

This is part of the, networks, feed forward. The Transformers feed forward network and down proj. Another part of the networks feed forward, of the Transformers feed forward network. Right.

And again, I'll repeat here, what we're just doing here is, listing the transformer layers, where Laura will be applied. That is what we're doing. The solution code is much more commented. You'll be able to check it out as well, step by step. But for these, highly recommend that you check out, you know, a, a Transformers architecture tutorial to be able to get an understanding of what we're actually doing here.

Right. The intuition, though, is that we're just adapting some parts of the factory. Right. And we're defining what these parts are.

Now we're going to set Laura Alpha equal 16. Right. So here, the higher this number is, the more weight changes.

The Laura, process will do to these layers. Right? So, Laura dropout, we're going to set equals zero, right. So this means no dropout.

Dropout means essentially how much information you retain in the weight updating process. So here we'll setting full retention of information. Right. Bias equals to none. Right.

This specifies whether, the Laura layers that we're updating should learn bias, terms. Right. So this essentially is a memory saving, technique.

Right. So here this is another, Unsworth, memory saving technique. Grading. Checkpointing equals Unsworth.

Okay. So this saves memory by, recomputing the activations that we're doing instead of storing them. And this, this is especially useful for fine tuning on long data sets with one context. I'm going to set random state equals three for seven. So this is for a preset ability. This is a random seed.

And then that's essentially it. So use our as Laura. So this is a set a technique of Laura called rank stabilized.

Laura. Right. So I'm going to set this as false. I'm not going to go into the details here. Just set this is false.

Because, this essentially goes beyond the scope of today's discussion on what, low rank approximation is. Adaptation is sorry. And then, so I'm going to set this as well.

Left config equals none. And this is a little bit fine tuning quantization. So we've disabled this because we already have four bit quantization. Now let's see if this works here.

Okay. We have gotten it done. Now what we need to do now is specify the fine tuning process. Right. So we're going to initialize the fine tuning trainer that we've gotten from, hugging faces SFD trainer. So if I go up here to my imports, we're going to be using this function.

Safety trainer, front transfer, transformer reinforcement learning. So with that said let's get started. Right. So we're going to define a trainer. Object. Right.

And we're going to again use the supervised fine tuning trainer, function. The model that we'll be using is this low rank adapted model. So model Laura. Right.

Then the tokenizer equals the tokenizer that we've already defined. So this is the tokenizer to input the the text inputs. Oops. Right. The training data set right is the data set for the fine tune that we've already defined previously. Right.

What is the field, that we're going to be using as training, which is this text field. Right. So data set text field equals text.

Right. What is the max sequence length and length. Sorry about that. That's already defined from our previous inference when we did the inference pre pre pre fine tuning.

So the max length which is I think 2048. What is the number of processors to use, to process the data. Right. So I set two CPUs here. And then what I'm going to do is I'm going to define the training arguments here. Right.

So we have this args input. And I'm going to set my training arguments again I already imported the training arguments from hugging face if you remember. Right.

So since we have two GPUs, the first one is per device, trained batch size. All right. So I've set it to two. Right. So this is a number of examples processed per device CPU at a time.

Right. So per device trained batch size. Gradient accumulation steps.

So. Equals four. Right. So here again, this is how many, how many steps the, the gradient needs to accumulate before updating weights.

So essentially how many steps does the, training process needs to take to a certain extent? If I explain the intuition here before we update it. Wait. Right. Number of training epochs. I'm going to set this to one so that we specify this, a for fine tune run.

Right. Warm up steps. So how many, equals five. So are we going to, essentially gradually increase the learning rate, for the first, five steps, right.

Max steps. So how many steps in the training process is a maximum. So 60 I'm going to set this here right. Learning rate. I also, this is also, kind of canonical here. I'm going to paste this here.

So this is I think 0.0002, if I'm not mistaken. Right. And then again, these are all arguments, that I recommend that you check out the, documentation from, hugging face to be able to get a better idea.

But again, intuitively speaking, what we're just doing is updating the model weights, and we're defining a few of these parameters that let us do this effectively. So other, functions here that I'm getting, the arguments that I'm using. Fp16.

So this is float point 16, right. And given that we, so here it would be to use float point 16 if, Bfloat16 is not, supported to speed up the training. So I'm going to do not is be float. 16 supported. And then bf 16 equals is B float 16 supported.

In case we have support here logging steps. Equals ten. So logs the training progress every ten steps here. Right. What is the optimizer that we're going to be using. So we're going to be using the Adam WB eight bit optimizer. Whoops.

We need to add here weight decay equals 0.01. And essentially here what this does is to it allows regularization to provide to prevent overfitting. So, what this does is as well like just, avoids the model to overfit on the data set, and becomes inaccurate on other tasks essentially, or just becomes only accurate on the training data. So what is the type of, learning rate that we're going to set? We're going to set it to linear. Again, these are all technicalities about, the, training process and fine tuned fine tuning process. My objective is to provide you intuition here, and then see it.

I'm going to set it as well to 3407, just so we can have a random seed for reproducibility. And the output directory, I'm going to set it to outputs. Okay. So I think this is it.

This should be correct. Oh we have a mistake. On the LR scheduler type. Cool.

This is LR scheduler type. Okay I think we're good to go here. Right. And now the process is to train the model.

Right. So we just need to train it. And all you need to do is do trainer stats equals trainer dot train. And I'm not going to run it here because.

And then what we want to do as well is use the weights and biases function to finish the fine tuning process. Right. And I'm not going to run it here because it's going to take 30 to 40 minutes. But I do have the already solution that you can see here. Right.

Now, as you can see, we've defined those, 60 steps that you can see here. So the model took 60 steps. It did 60 iterations on the fine tuning and it logged the performance every ten steps, as you can see here. And we see that the training loss went down, which is good to see. And then we ran the weights and bias.

Experiment and finished it. Right. And then we did get this run history and then run summary. And then we did say if you want to view your project, you can see it right here.

So I go to my fine tuning process. And as you can see my loss function is going down which is good to see. Right. And then now all I need to do is test out my fine tuned model. So going back here I'm not going to run the code here. But going through the same exact process, all what I'm doing is I'm loading the inference model, which is model law. Right.

Since we fine tuned that model, as you can see here. Right. And then, we're going to generate some output tokens. Right. So the first question is the same question we asked, before fine tuning, we see a much shorter chain of thought than before. And then the format, as you can see, three sentences of three lines, much shorter as well than the previous one. Now, I also got another question here that you can see 59 year old men with fevers, chills, so and so forth. Right.

And as you can see, a much shorter chain of thought and a, much shorter answer as well. Now here this example right is fine tuning on a medical data set. But you can really depend on the data sets that you find on hugging face.

Fine tune your model on almost anything. And that's it for today's tutorial. And whether Deep Sick ends up being a fad or all smoke and mirrors, I think it's up to be seen. But what is undeniable is that you have a pretty strong reasoning model that you can fine tune today on your own data sets. In a lot of ways, this is still the beginning.

As I said, since continuously improve and whatever you tool, you build or application you build, make sure you build it in a model agnostic way because you don't know what the next best model will be or the most cheap model will be. And with that, I think this is a great place to wrap it up. Today we left a bunch of resources in the description, a lot of explanations on some of those technical terms that we weren't able to explain. But with that, have a great day and happy learning.

2025-02-04 09:14

Show Video

Other news

Trump Threatens 25% Tariffs on Apple | Bloomberg Technology 05/23/2025 2025-05-27 22:25

Varun Chhabra, Dell & Kari Briski, NVIDIA | Dell Technologies World 2025 2025-05-27 02:08

Computex 2025: Intel Core Ultra Update for Workstations and More | Talking Tech | Intel Technology 2025-05-25 19:40