AI Home Server 24GB VRAM $750 Budget Build and LLM Benchmarking

Show Video

today we're going to be checking out the cheapest way you can get 24 GB of VRAM in a dedicated machine the machine that we're going to be using for this is an HP Z440 and this whole thing should come in under $750 and that's with two 3060 12 GB GPUs we're going to have a decent amount of RAM and the processor that you can get for this has a wide variety of options so we're also going to be looking at the electrical usage the tokens per second that you could expect to get on various models this will allow you to fit larger models and larger context windows this is much cheaper than a rig that has quad 3090s in it and VRAM is always king no matter what be sure you're always optimizing for whatever you can afford that's reasonable for VRAM because that right there is going to be the biggest impact you have on your experience at the end of this we'll also check out the price on this and I'll take you kind of shopping and we'll look at what the spreadsheet looks like and some alternatives that you might want to consider there are some limitations with the Z440 but this is actually not a very complex setup follow the guide that I produced recently on Proxmox and LXC's to get up and running with that i also have one for Proxmox and Docker if you would prefer to go that route both of those in the channel history make sure to hit like and subscribe while you're down there big shout out and thanks to all of our channel members and subscribers let's get started the Z440 can support quite a few different processors the processor it comes with might be okay some of the lower core count but higher core thread single speeds might be better off for your inference needs you also have eight DIM slots this is awesome if you are looking for cheap ways to grow DDR4 in a system to a very large level you don't have to go with the fastest DDR4 either actually 2400 speed DDR4 would be perfectly fine in this setup now we also have probably one of the things that workstations are very well known for a lot of PCIe slots as well the Xeon chip supports lots of PCIe on the CPU itself and with two full-size slots that makes it really easy to put in things like two GPUs this slot down here is an old school uh PCI slot not PCIe so you can just ignore that this is a great slot also to cover up with a GPU your front USB 3 connector i did have to unplug this just so I could get this in and I mean maybe you could make the USB 3 connector that is on here work maybe you could shave it down with an Xacto or something for a very cheap case and computer a really quick way to get up and running basically we were getting warnings about the front USB not being connected so what I did here was use a jumper that I found on an old motherboard to connect four to four so that's one two three exposed and then the fourth is bridged up to the one two three exposed fourth there if it comes with a 700 watt class power supply which you want it to say 700 watt and it says that right here on the uh little tag if as long as it comes with a 700 watt power supply you're going to get two separate six pins and these two separate six pins each one of them can support 150 watt your 30 60 12 GB is technically 170 watt you could limit that if you were concerned about it but you're very very unlikely to spike it up there for very long periods of time that would cause any sort of uh blip out or anything like that future options if you wanted to get some inserts some 5.25 in to maybe 3.5 in or 2.5 in there's some options I'll drop that I have used and I think they're pretty cool and this would allow this to have up to six SSD drives in one of these slots which is pretty cool as well you've got some traditional slots right here that you can use for 3.5 inch drives six SATA ports on this motherboard i'm going to be using EVO 870s you can pick these up really cheap now in the 1 TB range for around $50 and even in the two TB range for around $100 but yeah you're going to end up with these two connections and you're going to need to get six pin to eight pin connectors and I've got the links for these dropped below okay so now that we've got those two ends connected I'm going to go ahead and insert the GPUs here you kind of have this little bar that slides down so that thing can definitely get in the way be careful of it also it has some sharp edges i have cut myself on the inside of this case once already all right got number one in and number two in and we'll close that up so go ahead plug in your power adapters for your GPUs if you've gotten any extra RAM make sure that you refer to the RAM population guide that's inside your Z440 workstation here it also gives you diagnostics if you run into beeps or something like that what you would need to check so in our instance RAM slot uh one and two are RAM slot one on the far side and RAM slot two on the far side might look a little bit weird at first but definitely that is the correct configuration if you just have two dims for sure if you're buying this on a used market which eBay great place to pick these up for really good prices also Amazon has them for pretty cheap as well make sure you change out your thermal paste it can be pretty old on the CPU now we're going to put in the uh fan cover and it has a power connector that is oriented to go up here so make sure you have the correct orientation on that and it just kind of slides into the two arms that you'll clearly see it have supports built in for them and just snap it on down and you're good seal it on up and that's it we're ready to go so let me walk you through the exact settings I've got in my BIOS so you can match those and not run into any problems you can if you're connected to the internet and you have a DHCP address you can run a update for the BIOS that might not be a bad thing to do but I'm not actually really that concerned with it on mine so yeah you can do a system BIOS update if you want to if it's very out of date you might need to do that it's going to I'm going to leave that one up to you to decide so over here is the important stuff in advanced if we go here you can see that I've got pretty much the post delay set pretty low i've also got fast boot disabled so I can see what's coming on the screen here and I think this is uh you know some settings I found that worked i I wouldn't be surprised if it didn't come like this but make sure there is this one that you have here this one's important enable legacy BIOS uh boot options that way you can actually boot in on the Samsung SSD that is plugged in there you can see I've got a 500 GB one so backing out of this you could go to device configuration i don't think that you have to do anything other than enable your ports that you want to enable disable ports you don't want to have enabled and make sure that you are in AC AHCI mode other than that the rest of this should be pretty good to just leave as it you know kind of stock came for your secure boot configuration enable legacy support and disable secure boot is what I've got set here for your power options you can kind of do some extended runtime settings this one might help you idle down a little bit better and also uh runtime power management having that enabled also your idle fan mode you could set this if you wanted to if you hear right now it's going to be a little bit noisy as it's in the BIOS but as soon as we get out of the BIOS and into the operating system it'll hand over and take control of that as far far as performance options we're going to have hyperthreading enabled we've got disabled for PCIe performance mode so I'm actually going to go ahead and change that to enabled here we'll take a little bit more wattage to make that happen so keep that in mind mmio I have to auto ISOC I have to enabled and I do have all cores per processor enabled also and as far as your slot settings if you wanted to disable slots here you would be able to do that you don't necessarily need to do that it's going to detect if you have something in there or not so don't feel like you have to go and do any of that your graphics configuration i've got this just set to my Nvidia controller that is in slot two i think that's fine it'll detect a GPU plugged in and you do have to have a GPU plugged in if you don't have a GPU plugged in it's just going to beep at you and you won't be able to do anything so make sure you have a GPU plugged in so yeah go through that if you make any changes save your changes it should reboot install so next we're going to install our USB drive and install Proxmox so while I will go through the entire setup and installation process with you of course you can follow that entire step-by-step guide over here on digital spaceport.com and the written write-up guide with the accompanying video that I've got local AI software easy setup guide with Olama and LXC in Proxmox as you follow through that we will hit one point where we will add the GPUs and so we'll pick it up from there at this point and so let me do an ls of dev slashinvidia and asterric and you can see the devices named here so we've got this path and then these two devices living inside there that asterisk is not a valid device path so device zero is GPU0 and GPU1 we're going to pass these through to our Lexe container just by grabbing that coming over to the Lexe container go to your resources add pass through to dev invidia zero is the first one and so at the end of adding in all the devices it should look like this now you're going to need to stop and restart to have this take effect so log in really quick and do a apt update just to see if there's anything you need to update doesn't look like it and do a shutdown now it'll power down and then when we power it back on the Nvidia GPUs will be passed through go ahead hit start so now we need to start our container back up again after we have those devices uh passed through as resources and you can see they'll change from kind of a yellowy color to this as soon as they get passed through and over here on our shell we are going to do a quick ls copy the full path of this and do pct push 100 paste in that hit the right arrow forward slashroot slash paste the name again this is pushing the driver into the container so your container does need to be run running for that to work when you come back here go ahead log in and do a quick ls so next do a ch plus x and nv asterric now go ahead and run dot slash nv asterric no kernel modules and that is because we're using the actual LXC does not use the uh its own kernel it uses Proxmox kernel so make sure you select the proprietary as that's what we've installed already and it'll go ahead and install here it'll tell you about restarting X you don't have a desktop environment installed of course so not needed now when you run in VTOP which you might need to apt install NVTOP you'll see that you've got your two 3060s with 12 GB of VRAM each here installed super freaking awesome now we're ready to test out what the performance looks like on some of the models that we've got here so we're going to start off by running just a kind of greeting just saying hi to the LLM getting a initial warm-up that's not the response tokens per second and prompt tokens per second I'm going to record however that's just to get it loaded make sure everything's up and running but it is maybe possibly interesting for you to see that on this one 14 and about 11 12 for QWQ 32B and if you look over on the side here I'm running something called HTTOP and with htop you can kind of arrow over to the right and see the parameters that it's loading with so you see we have 4096 loaded in as the CTX size and parallel is two so this is what it is essentially running under the hood if you come over here you can see that we've got roughly about 84 85% of the memory taken up here so we're going to do a new chat window here and quickly toss it armageddon with a twist and I'm not really checking to see whether or not it gets this right we're just checking to benchmark the performance and see how many tokens per second it can produce so whatever it says it really doesn't matter but QWQ I will say this is one of the better models and this is one of the primary reasons as a 32B that you're looking for a large amount of VRAM because right now this is running all in VRAM so all of this is fitting in that 84 to 85% of the GPUs and if you see this it says 100% in the GPUs and the size that it's detecting is roughly 24 GB so that is pretty close to the estimation of about 86 84 so you can see that creeping up just a little bit as it continues to burn through and churn into the context window and we're just going to go directly to the response tokens per second and those were at 11.92 and the prompt tokens per second were uh 1260 so not bad so 11.92 and 1260 let's toss that in over here and you can see the information that I'm kind of capturing here is what was the CTX size on this so we had 4096 on the CTX size and the num parallel is something you set as an special environment variable we're going to move on to Jimma 3 27B Q4 and we're just going to run this one at 2048 i don't know whether it's going to fit or not so this is a good question all right so next up we're going to check Jimma 3 27B and you can see that the settings I've got here since this is the 27 billion parameter that's quite large we're going to set the quant to four and I've just got the CTX set to 2048 i've been having some trouble getting this and drilling down into why it's actually only using seemingly about 75% of the uh VRAM but yet it is splitting still that's a little bit of a bummer and that definitely is going to slow things down quite a bit and so there's got to be some settings somewhere that you can adjust that would uh make this a lot better is my hope also the GPU utilization looks surprisingly low here at about 16 to 15% but you kind of start seeing that happen if you get into split mode and of course yeah your tokens per second much lower so that's about six and 403 so drop uh your ideas in the comments below weird bug maybe and I am on the latest of everything here so it shouldn't be that I'm out of date on and we're on 0.6.5 so that that is quite quite new next let's give Kito a run here and this is a 14 billion Q8 and that is kind of my favorite is Q8 you get a lot of additional uh options as far as decisions so the evaluation of the next step is always on that next layer going to have about 256 i think it might actually be exactly 256 with a Q8 with a Q4 you have 16 so a lot more additional options to be considered by the model and if you're looking at like full precision well that's like billions or something like that it's kind of kind of crazy so uh let's just give it a quick warm up here and see if we can get this to reside completely in the GPU memory so that we don't have a slowdown that is one goal that I'm trying to do it looks like we're about 10 under 10 gigabytes there so that looks that looks promising so we got 17.52 tokens

uh response back there and 16 on prompt but like I said we're only interested in running Armageddon with a twist to see how it degrades and so let's do that here little bit bigger and this can oftent times take a little bit more time for an LLM to process so more to tokens burned can kind of demonstrate also what the capacity you're going to be eating into as far as extra VRAM because it can grow throughout time a little bit and it looks like we have 16.68 response tokens per second and 1558 prompt tokens per second is that 128 1558 even close And next we're going to move up to a Gemma 3 but this Gemma 3 is different because it's a Q8 Gemma 3 that means we're going down on the parameter size to enable it to fit all of these are 4096 so I need to correct that except for Gemma 327 uh that Q4 not able to fit in there effectively even though there did appear to be enough room and if we check this here we can see that we hit 19.79 and 16 but let's go ahead and see how it handles Armageddon with a twist and it looks like it's keeping pace pretty good here so we'll see as it uh continues on you can see it kind of grow there a little bit like I was mentioning eventually it stops growing for the most part and that is just probably why you need to continually be in a new window for most reason for most of the LLMs out there to continue the conversation after you run out of context yeah and that is actually really good that was 19.67 response tokens per second prompt tokens per second 1,012 And we're going to download Mistl small 3.1 that's just out and also deep coder preview those are 14B and a 24B model all right let's load up Mistl now oh darn that is not very close and we definitely saw pretty big impact on the response tokens per second as a result of that so just the warm up there was at 2.5 response tokens per second on Mistl but yeah seeing that

at 16 and 84 and yeah you can see this is uh something's a miss here definitely got a bug in Mistl or something going on here you'll get a lot of that if you're trying to catch the latest and greatest cutting edge stuff out there so seeing that one of these GPUs has loaded up and the other is not loaded at all definitely indicates to me that we've got some sort of a problem that we did not have with the other models out there so that makes uh any results we would get from Mistl just kind of moot so I guess the best thing I can do is just move on to deep coder and not let it not let it impact us at the moment you want to load things into your GPUs and have 100% GPU utilization it'll give you responses that are nice and snappy like that where we just saw 17.8 and 7.8 come back on Deepcoder 14B Q8 and you can look over here and see that we're about 71% or 72% of the GPU utilized this is actually really cool because this would allow you to go quite a bit larger with your context window which especially for code work you definitely want to do that at any rate but we're really just testing the kind of performance of it in a consistent fashion so people could think see what they could expect and yeah you know this one I think pretty cool uh since it's a thinking tune uh that we can maybe get some decent answers out of it especially for code tasks that probably helps out a little bit i'm not sure how it's going to do as a general chat agent especially being built as deep coder uh I would expect that it's pretty pretty much optimized for coding tasks but it does look like it's performing at a pretty decent tokens per second and keeping pace and we can see that we got back 17 response tokens per second 1,654 prompt tokens per second though and that looks like we're seeing right around 71% and for Kito we saw 55% broken and Gemma 3 12B Q8 what did we see when that one was loaded up let's go back and give that one one more chance here think we got right around 19.6 response tokens per second and about 340 prompt tokens per second and the VRAM we're observing here I'd call that about 60% total being utilized so that gives us a pretty good idea of some of the best that you would be able to run uh fully in VRAM and these are all very very new so these are have all come out like really recently and once you start doing this there's so much and Olama is really your gateway drug super simple makes things easy they uh are deploying a new engine themselves so a lot of stuff is getting fixed and getting better under the hood but there has been recently as a result of their new uh engine runner thing that they're developing uh some hiccups so things have been a little bit slower to get out the door but I think that we'll see that pick back up here pretty shortly but you can see like deep coder mist kito these are all incredibly recent jimma 3 uh jimma 3 really is the strongest one that's out there in this class and size and so even when we're looking over here at the kind of analysis of it you can see that we we we got pretty good amount of VRAM utilization of course you want to push it further than this and get it into the '9s if you can ideally uh and also give yourself you know that extra context window length so there are some of the models that you could run i did make a switch and there is another surprise video that'll be coming out so make sure you hit like and subscribe because we've got more coming on specifically uh the performance of this system z440 getting two videos cuz there's something interesting I found and I think you're going to find it pretty interesting also so taking a look real quick here at the $750 uh AI system so this is really under $750 if you were to of course go with just the things that came with it so you don't have to buy any uh RAM if it comes with RAM most of the systems out there come with some sort of RAM so that could be a potential to have well if you put that at zero uh $30 taken off 250 does look like the price point for 3060 12 gigabytes on eBay right now it does not look like that price point on Amazon right now so do keep in mind that is $500 when you're looking at two of those i put in a 500 gigabyte here and that is a very small if you are seriously you know wanting to tinker around and have fun get a get a 1 TB at least uh SSD it'll make things a lot funner for you you can download a lot more stuff and especially you saw the Proxmox helper script site that is linked in the comments below the description below that is a excellent place where you will absolutely be able to find so many things that make it really quick to get up and running but yeah the total of this system is actually surprisingly under 750 now of course that could change inflation who knows what i mean the base Z440s with a CPU and usually with 16 gigs of RAM are somewhere around $100 that usually does not include any sort of an SSD so you definitely should you know probably set aside between $30 and $50 if you're going to get a you know one TBTE SSD excellent way to spend extra money but you should be looking at actually around 650 now of course there's going to be taxes and most of the $100 ones are actually free shipping but that does bring at this price point our price per gigabyte of VRAM to 26.25 which is incredibly low compared to so many other systems out there so really this is pretty cool and we're going to see some even cooler capabilities that the same system has which uh I think we're going to be surprised uh so yeah you're going to you're going to get a backto-back video on this Z440 so there you have it this is a pretty awesome little system and it can expand out to do quite a lot even supporting up to 22 core processors we're going to see some pretty cool stuff uh in just a couple more hours i have a feeling maybe tomorrow but very soon after this video is out a second Z440 video is going to be dropping and I've got some uh really interesting things that I have to test out a little bit more but I am very excited to show you some of that also so there you have it the HP Z440 around or under $750 definitely under if you have any extra parts and also if you are just buying the Z440 itself and the 3060 12 GB certainly a lot of people would think to themselves hey how else could I get 24 GB and there are other ways however this is one of the cheapest ways to get there and for something that runs in a system that is idling at about 65 watts and peaks up at about 190 watts that's not too bad and that really does have me feeling like this is a pretty decent and I definitely know this from other experience this is just a really good home lab unit these are well well known for their reliability having eight dim slots is pretty crazy you don't get that in many things and so when you really load that up you've got a lot of RAM and you can run a infinite amount almost of VMs usually RAM's going to be the thing that gives out first and the wide variety of CPUs even the E5 2600 series being functional in here that's pretty crazy so yeah I hope you've had a good time checking out this video with me and uh yeah I think there's some cool stuff that you're going to see tomorrow i'll have a link to all of this stuff the build guide i've got the statistics I'm putting over on the website article all of that linked in the description below also drop your comment make sure you toss a thumbs up if you like this kind of content and if you're interested hit that subscribe button and you can even ring the bell so you get notified when I drop content like this do make sure you check out the history sort it by the newest and make sure that if there is a more recent Proxmox Lexe or maybe even Docker Guide if you want to go that route i will be working on a new Docker guide going to try to make Lexe release and Docker release happen kind of at the same time because Docker is making a lot of really new advancements in just the past couple days it's been adv announced some of their uh initiatives so I don't want to drop uh supporting Docker because I think there's going to be room in the same machine for both of these to run because inside an Lexe Docker can run inside an Lexe just perfectly fine you're going to have a capability to share your most valuable resource GPUs which of course saves you a lot of money you can share those not just with other AI applications but with all sorts of other virtual machines and services like I mentioned Proxmox helper scripts website in the description below a wonderful place to begin your home labing experience and make sure to let me know what your projects are that you're working on and uh yeah I think this is pretty exciting and like I've dropped a couple of times this hint make sure that you have rung the bell if you're interested in this machine in particular and also ridiculousness so everybody have a great rest of your day and I will check you out next time

2025-04-14 17:09

Show Video

Other news

The Minimal Phone sucks...and that makes it GREAT! | Digital Minimalism Device 2025-06-05 07:11

Nvidia Shrugs Off China Concerns With Upbeat Forecast | Bloomberg Technology 2025-05-30 11:26

Salesforce to Buy Informatica, Apple’s Tariff Headwinds | Bloomberg Technology 5/27/2025 2025-05-29 12:47