AI Home Server 24GB VRAM $750 Budget Build and LLM Benchmarking

AI Home Server 24GB VRAM $750 Budget Build and LLM Benchmarking

Show Video

today we're going to be checking out the  cheapest way you can get 24 GB of VRAM in   a dedicated machine the machine that we're  going to be using for this is an HP Z440 and   this whole thing should come in under $750 and  that's with two 3060 12 GB GPUs we're going to   have a decent amount of RAM and the processor  that you can get for this has a wide variety   of options so we're also going to be looking at  the electrical usage the tokens per second that   you could expect to get on various models this  will allow you to fit larger models and larger   context windows this is much cheaper than  a rig that has quad 3090s in it and VRAM is   always king no matter what be sure you're always  optimizing for whatever you can afford that's   reasonable for VRAM because that right there  is going to be the biggest impact you have on   your experience at the end of this we'll also  check out the price on this and I'll take you   kind of shopping and we'll look at what the  spreadsheet looks like and some alternatives   that you might want to consider there are some  limitations with the Z440 but this is actually   not a very complex setup follow the guide that  I produced recently on Proxmox and LXC's to get   up and running with that i also have one for  Proxmox and Docker if you would prefer to go   that route both of those in the channel history  make sure to hit like and subscribe while you're   down there big shout out and thanks to all of  our channel members and subscribers let's get started the Z440 can support quite a few different  processors the processor it comes with might be   okay some of the lower core count but higher  core thread single speeds might be better off   for your inference needs you also have eight  DIM slots this is awesome if you are looking   for cheap ways to grow DDR4 in a system to a very  large level you don't have to go with the fastest   DDR4 either actually 2400 speed DDR4 would be  perfectly fine in this setup now we also have   probably one of the things that workstations are  very well known for a lot of PCIe slots as well   the Xeon chip supports lots of PCIe on the CPU  itself and with two full-size slots that makes   it really easy to put in things like two GPUs  this slot down here is an old school uh PCI   slot not PCIe so you can just ignore that this  is a great slot also to cover up with a GPU your   front USB 3 connector i did have to unplug  this just so I could get this in and I mean   maybe you could make the USB 3 connector that is  on here work maybe you could shave it down with   an Xacto or something for a very cheap case and  computer a really quick way to get up and running basically we were getting warnings about the front  USB not being connected so what I did here was use   a jumper that I found on an old motherboard  to connect four to four so that's one two   three exposed and then the fourth is bridged  up to the one two three exposed fourth there if it comes with a 700 watt class power supply  which you want it to say 700 watt and it says   that right here on the uh little tag if as long  as it comes with a 700 watt power supply you're   going to get two separate six pins and these two  separate six pins each one of them can support   150 watt your 30 60 12 GB is technically 170 watt  you could limit that if you were concerned about   it but you're very very unlikely to spike it up  there for very long periods of time that would   cause any sort of uh blip out or anything like  that future options if you wanted to get some   inserts some 5.25 in to maybe 3.5 in or 2.5 in  there's some options I'll drop that I have used   and I think they're pretty cool and this would  allow this to have up to six SSD drives in one   of these slots which is pretty cool as well you've  got some traditional slots right here that you can   use for 3.5 inch drives six SATA ports on this  motherboard i'm going to be using EVO 870s you   can pick these up really cheap now in the 1  TB range for around $50 and even in the two   TB range for around $100 but yeah you're going to  end up with these two connections and you're going   to need to get six pin to eight pin connectors  and I've got the links for these dropped below okay so now that we've got those two  ends connected I'm going to go ahead   and insert the GPUs here you kind of have this  little bar that slides down so that thing can   definitely get in the way be careful of  it also it has some sharp edges i have   cut myself on the inside of this case  once already all right got number one in and number two in and we'll close that up so  go ahead plug in your power adapters for your GPUs if you've gotten any extra RAM make  sure that you refer to the RAM population   guide that's inside your Z440 workstation  here it also gives you diagnostics if you   run into beeps or something like that what you  would need to check so in our instance RAM slot   uh one and two are RAM slot one on the far side  and RAM slot two on the far side might look a   little bit weird at first but definitely that is  the correct configuration if you just have two   dims for sure if you're buying this on a used  market which eBay great place to pick these up   for really good prices also Amazon has them for  pretty cheap as well make sure you change out your   thermal paste it can be pretty old on the CPU now  we're going to put in the uh fan cover and it has   a power connector that is oriented to go up here  so make sure you have the correct orientation on   that and it just kind of slides into the two arms  that you'll clearly see it have supports built in   for them and just snap it on down and you're good  seal it on up and that's it we're ready to go so let me walk you through the exact settings I've  got in my BIOS so you can match those and not run   into any problems you can if you're connected  to the internet and you have a DHCP address you   can run a update for the BIOS that might not be a  bad thing to do but I'm not actually really that   concerned with it on mine so yeah you can do a  system BIOS update if you want to if it's very out   of date you might need to do that it's going to  I'm going to leave that one up to you to decide so   over here is the important stuff in advanced if we  go here you can see that I've got pretty much the   post delay set pretty low i've also got fast boot  disabled so I can see what's coming on the screen   here and I think this is uh you know some settings  I found that worked i I wouldn't be surprised if   it didn't come like this but make sure there is  this one that you have here this one's important   enable legacy BIOS uh boot options that way you  can actually boot in on the Samsung SSD that is   plugged in there you can see I've got a 500  GB one so backing out of this you could go to   device configuration i don't think that you have  to do anything other than enable your ports that   you want to enable disable ports you don't want  to have enabled and make sure that you are in AC   AHCI mode other than that the rest of this should  be pretty good to just leave as it you know kind   of stock came for your secure boot configuration  enable legacy support and disable secure boot is   what I've got set here for your power options you  can kind of do some extended runtime settings this   one might help you idle down a little bit better  and also uh runtime power management having that   enabled also your idle fan mode you could set  this if you wanted to if you hear right now it's   going to be a little bit noisy as it's in the  BIOS but as soon as we get out of the BIOS and   into the operating system it'll hand over and  take control of that as far far as performance   options we're going to have hyperthreading enabled  we've got disabled for PCIe performance mode so   I'm actually going to go ahead and change that to  enabled here we'll take a little bit more wattage   to make that happen so keep that in mind mmio  I have to auto ISOC I have to enabled and I do   have all cores per processor enabled also and as  far as your slot settings if you wanted to disable   slots here you would be able to do that you don't  necessarily need to do that it's going to detect   if you have something in there or not so don't  feel like you have to go and do any of that your   graphics configuration i've got this just set to  my Nvidia controller that is in slot two i think   that's fine it'll detect a GPU plugged in and you  do have to have a GPU plugged in if you don't have   a GPU plugged in it's just going to beep at you  and you won't be able to do anything so make sure   you have a GPU plugged in so yeah go through  that if you make any changes save your changes   it should reboot install so next we're going  to install our USB drive and install Proxmox so while I will go through the entire setup and  installation process with you of course you can   follow that entire step-by-step guide over here  on digital spaceport.com and the written write-up   guide with the accompanying video that I've  got local AI software easy setup guide with   Olama and LXC in Proxmox as you follow through  that we will hit one point where we will add the GPUs and so we'll pick it up from there at this  point and so let me do an ls of dev slashinvidia   and asterric and you can see the devices named  here so we've got this path and then these two   devices living inside there that asterisk is not  a valid device path so device zero is GPU0 and   GPU1 we're going to pass these through to our Lexe  container just by grabbing that coming over to the   Lexe container go to your resources add pass  through to dev invidia zero is the first one   and so at the end of adding in all the devices it  should look like this now you're going to need to   stop and restart to have this take effect so log  in really quick and do a apt update just to see   if there's anything you need to update doesn't  look like it and do a shutdown now it'll power   down and then when we power it back on the Nvidia  GPUs will be passed through go ahead hit start so   now we need to start our container back up again  after we have those devices uh passed through as   resources and you can see they'll change from  kind of a yellowy color to this as soon as they   get passed through and over here on our shell we  are going to do a quick ls copy the full path of   this and do pct push 100 paste in that hit the  right arrow forward slashroot slash paste the   name again this is pushing the driver into  the container so your container does need to   be run running for that to work when you come  back here go ahead log in and do a quick ls   so next do a ch plus x and nv asterric now go  ahead and run dot slash nv asterric no kernel   modules and that is because we're using the  actual LXC does not use the uh its own kernel   it uses Proxmox kernel so make sure you select  the proprietary as that's what we've installed already and it'll go ahead and install  here it'll tell you about restarting X   you don't have a desktop environment  installed of course so not needed   now when you run in VTOP which you might need  to apt install NVTOP you'll see that you've got   your two 3060s with 12 GB of VRAM each here  installed super freaking awesome now we're   ready to test out what the performance looks  like on some of the models that we've got here so we're going to start off by running just a kind  of greeting just saying hi to the LLM getting a   initial warm-up that's not the response tokens per  second and prompt tokens per second I'm going to   record however that's just to get it loaded make  sure everything's up and running but it is maybe   possibly interesting for you to see that on this  one 14 and about 11 12 for QWQ 32B and if you look   over on the side here I'm running something called  HTTOP and with htop you can kind of arrow over to   the right and see the parameters that it's loading  with so you see we have 4096 loaded in as the CTX   size and parallel is two so this is what it is  essentially running under the hood if you come   over here you can see that we've got roughly about  84 85% of the memory taken up here so we're going   to do a new chat window here and quickly toss  it armageddon with a twist and I'm not really   checking to see whether or not it gets this right  we're just checking to benchmark the performance   and see how many tokens per second it can produce  so whatever it says it really doesn't matter but   QWQ I will say this is one of the better models  and this is one of the primary reasons as a 32B   that you're looking for a large amount of VRAM  because right now this is running all in VRAM   so all of this is fitting in that 84 to 85% of  the GPUs and if you see this it says 100% in the   GPUs and the size that it's detecting is roughly  24 GB so that is pretty close to the estimation of   about 86 84 so you can see that creeping up just  a little bit as it continues to burn through and   churn into the context window and we're just going  to go directly to the response tokens per second   and those were at 11.92 and the prompt tokens  per second were uh 1260 so not bad so 11.92 and   1260 let's toss that in over here and you can see  the information that I'm kind of capturing here   is what was the CTX size on this so we had 4096 on  the CTX size and the num parallel is something you   set as an special environment variable we're going  to move on to Jimma 3 27B Q4 and we're just going   to run this one at 2048 i don't know whether it's  going to fit or not so this is a good question all   right so next up we're going to check Jimma 3  27B and you can see that the settings I've got   here since this is the 27 billion parameter that's  quite large we're going to set the quant to four   and I've just got the CTX set to 2048 i've been  having some trouble getting this and drilling down   into why it's actually only using seemingly about  75% of the uh VRAM but yet it is splitting still   that's a little bit of a bummer and that  definitely is going to slow things down   quite a bit and so there's got to be some  settings somewhere that you can adjust that   would uh make this a lot better is my hope also  the GPU utilization looks surprisingly low here   at about 16 to 15% but you kind of start seeing  that happen if you get into split mode and of   course yeah your tokens per second much lower so  that's about six and 403 so drop uh your ideas in   the comments below weird bug maybe and I am on  the latest of everything here so it shouldn't   be that I'm out of date on and we're on 0.6.5 so  that that is quite quite new next let's give Kito   a run here and this is a 14 billion Q8 and that  is kind of my favorite is Q8 you get a lot of   additional uh options as far as decisions so the  evaluation of the next step is always on that next   layer going to have about 256 i think it might  actually be exactly 256 with a Q8 with a Q4 you   have 16 so a lot more additional options to be  considered by the model and if you're looking   at like full precision well that's like billions  or something like that it's kind of kind of crazy   so uh let's just give it a quick warm up here and  see if we can get this to reside completely in the   GPU memory so that we don't have a slowdown that  is one goal that I'm trying to do it looks like   we're about 10 under 10 gigabytes there so that  looks that looks promising so we got 17.52 tokens  

uh response back there and 16 on prompt  but like I said we're only interested in   running Armageddon with a twist to see  how it degrades and so let's do that here little bit bigger and this can oftent  times take a little bit more time for an LLM   to process so more to tokens burned can kind of  demonstrate also what the capacity you're going   to be eating into as far as extra VRAM because  it can grow throughout time a little bit and   it looks like we have 16.68 response tokens  per second and 1558 prompt tokens per second is that 128 1558 even close And next we're going to move up  to a Gemma 3 but this Gemma 3 is   different because it's a Q8 Gemma 3  that means we're going down on the   parameter size to enable it to fit all  of these are 4096 so I need to correct that except for Gemma 327 uh  that Q4 not able to fit in   there effectively even though  there did appear to be enough room and if we check this here we can  see that we hit 19.79 and 16 but let's   go ahead and see how it handles Armageddon with a twist and it looks like it's keeping  pace pretty good here so we'll see as   it uh continues on you can see it kind  of grow there a little bit like I was mentioning eventually it  stops growing for the most part and that is just probably why you need to  continually be in a new window for most reason   for most of the LLMs out there to continue  the conversation after you run out of context yeah and that is actually really good that was   19.67 response tokens per second  prompt tokens per second 1,012 And we're going to download Mistl small 3.1  that's just out and also deep coder preview   those are 14B and a 24B model all right let's  load up Mistl now oh darn that is not very close   and we definitely saw pretty big impact on the  response tokens per second as a result of that   so just the warm up there was at 2.5 response  tokens per second on Mistl but yeah seeing that  

at 16 and 84 and yeah you can see this is uh  something's a miss here definitely got a bug in   Mistl or something going on here you'll get a lot  of that if you're trying to catch the latest and   greatest cutting edge stuff out there so seeing  that one of these GPUs has loaded up and the other   is not loaded at all definitely indicates to me  that we've got some sort of a problem that we did   not have with the other models out there so that  makes uh any results we would get from Mistl just   kind of moot so I guess the best thing I can do is  just move on to deep coder and not let it not let   it impact us at the moment you want to load things  into your GPUs and have 100% GPU utilization   it'll give you responses that are nice and snappy  like that where we just saw 17.8 and 7.8 come back   on Deepcoder 14B Q8 and you can look over here  and see that we're about 71% or 72% of the GPU   utilized this is actually really cool because this  would allow you to go quite a bit larger with your   context window which especially for code work you  definitely want to do that at any rate but we're   really just testing the kind of performance of it  in a consistent fashion so people could think see   what they could expect and yeah you know this one  I think pretty cool uh since it's a thinking tune   uh that we can maybe get some decent answers  out of it especially for code tasks that   probably helps out a little bit i'm not sure  how it's going to do as a general chat agent   especially being built as deep coder uh I would  expect that it's pretty pretty much optimized   for coding tasks but it does look like it's  performing at a pretty decent tokens per second   and keeping pace and we can see that we got back  17 response tokens per second 1,654 prompt tokens   per second though and that looks like we're  seeing right around 71% and for Kito we saw 55% broken and Gemma 3 12B Q8 what did we see  when that one was loaded up let's go back and   give that one one more chance here think we got  right around 19.6 response tokens per second and   about 340 prompt tokens per second and the VRAM  we're observing here I'd call that about 60% total   being utilized so that gives us a pretty good idea  of some of the best that you would be able to run   uh fully in VRAM and these are all very very  new so these are have all come out like really   recently and once you start doing this there's so  much and Olama is really your gateway drug super   simple makes things easy they uh are deploying a  new engine themselves so a lot of stuff is getting   fixed and getting better under the hood but  there has been recently as a result of their new   uh engine runner thing that they're developing  uh some hiccups so things have been a little   bit slower to get out the door but I think that  we'll see that pick back up here pretty shortly   but you can see like deep coder mist kito these  are all incredibly recent jimma 3 uh jimma 3   really is the strongest one that's out there in  this class and size and so even when we're looking   over here at the kind of analysis of it you can  see that we we we got pretty good amount of VRAM   utilization of course you want to push it  further than this and get it into the '9s   if you can ideally uh and also give yourself  you know that extra context window length so   there are some of the models that you could run  i did make a switch and there is another surprise   video that'll be coming out so make sure you hit  like and subscribe because we've got more coming   on specifically uh the performance of this  system z440 getting two videos cuz there's   something interesting I found and I think  you're going to find it pretty interesting also so taking a look real quick here at the  $750 uh AI system so this is really under   $750 if you were to of course go with just the  things that came with it so you don't have to   buy any uh RAM if it comes with RAM most of the  systems out there come with some sort of RAM   so that could be a potential to  have well if you put that at zero   uh $30 taken off 250 does look like the price  point for 3060 12 gigabytes on eBay right now   it does not look like that price point on  Amazon right now so do keep in mind that is   $500 when you're looking at two of those i put  in a 500 gigabyte here and that is a very small   if you are seriously you know wanting to tinker  around and have fun get a get a 1 TB at least   uh SSD it'll make things a lot funner for you you  can download a lot more stuff and especially you   saw the Proxmox helper script site that is linked  in the comments below the description below that   is a excellent place where you will absolutely be  able to find so many things that make it really   quick to get up and running but yeah the total  of this system is actually surprisingly under   750 now of course that could change inflation who  knows what i mean the base Z440s with a CPU and   usually with 16 gigs of RAM are somewhere around  $100 that usually does not include any sort of an   SSD so you definitely should you know probably set  aside between $30 and $50 if you're going to get a   you know one TBTE SSD excellent way to spend extra  money but you should be looking at actually around   650 now of course there's going to be taxes and  most of the $100 ones are actually free shipping   but that does bring at this price point our price  per gigabyte of VRAM to 26.25 which is incredibly   low compared to so many other systems out there  so really this is pretty cool and we're going   to see some even cooler capabilities that  the same system has which uh I think we're   going to be surprised uh so yeah you're going to  you're going to get a backto-back video on this Z440 so there you have it this is a pretty awesome  little system and it can expand out to do quite   a lot even supporting up to 22 core processors  we're going to see some pretty cool stuff uh in   just a couple more hours i have a feeling maybe  tomorrow but very soon after this video is out   a second Z440 video is going to be dropping and  I've got some uh really interesting things that   I have to test out a little bit more but I am  very excited to show you some of that also so   there you have it the HP Z440 around or under $750  definitely under if you have any extra parts and   also if you are just buying the Z440 itself  and the 3060 12 GB certainly a lot of people   would think to themselves hey how else could  I get 24 GB and there are other ways however   this is one of the cheapest ways to get there  and for something that runs in a system that is   idling at about 65 watts and peaks up at about  190 watts that's not too bad and that really does   have me feeling like this is a pretty decent and  I definitely know this from other experience this   is just a really good home lab unit these are well  well known for their reliability having eight dim   slots is pretty crazy you don't get that in many  things and so when you really load that up you've   got a lot of RAM and you can run a infinite amount  almost of VMs usually RAM's going to be the thing   that gives out first and the wide variety of CPUs  even the E5 2600 series being functional in here   that's pretty crazy so yeah I hope you've had  a good time checking out this video with me and   uh yeah I think there's some cool stuff that  you're going to see tomorrow i'll have a link   to all of this stuff the build guide i've got the  statistics I'm putting over on the website article   all of that linked in the description below also  drop your comment make sure you toss a thumbs up   if you like this kind of content and if you're  interested hit that subscribe button and you can   even ring the bell so you get notified when I drop  content like this do make sure you check out the   history sort it by the newest and make sure that  if there is a more recent Proxmox Lexe or maybe   even Docker Guide if you want to go that route  i will be working on a new Docker guide going to   try to make Lexe release and Docker release happen  kind of at the same time because Docker is making   a lot of really new advancements in just the past  couple days it's been adv announced some of their   uh initiatives so I don't want to drop uh  supporting Docker because I think there's   going to be room in the same machine for both of  these to run because inside an Lexe Docker can   run inside an Lexe just perfectly fine you're  going to have a capability to share your most   valuable resource GPUs which of course saves you  a lot of money you can share those not just with   other AI applications but with all sorts of other  virtual machines and services like I mentioned   Proxmox helper scripts website in the description  below a wonderful place to begin your home labing   experience and make sure to let me know what  your projects are that you're working on and   uh yeah I think this is pretty exciting and  like I've dropped a couple of times this hint   make sure that you have rung the bell if you're  interested in this machine in particular and also   ridiculousness so everybody have a great rest  of your day and I will check you out next time

2025-04-14 17:09

Show Video

Other news

A1 Evo Acoustica 2025-04-25 08:54
New Legal Directions for a Global AI Commons 2025-04-23 17:43
How Governments Spy On Protestors—And How To Avoid It | Incognito Mode | WIRED 2025-04-20 22:51