running the full deepseek 671b on a $500 machine is possible and i'm actually surprised a little bit at the tokens per second we're able to get with it and this is on a $5 processor that's right you heard me right a $5 processor is what enables this along with 512 gb of ram and a very cheap hp z440 all of that is about $500 total so i'm sure you've got a lot of questions let's put this system together and start answering them and you can use the chapters below at any time to skip around any part of this video including the conclusion if you're interested in getting straight to the takeaways this is going to be a pretty long conclusion on this as well all of these stats are recorded at digital spaceport.com and the link to that is in the description below and it's really really simple all we're going to do is pop in the ram that is 64 gb lr dims and undocumented yet functional yet not necessarily functional the way you would hope for in an hp z440 and also make sure that you have good tight contact on your cpu whichever cpu you want to go with and we are going to check out the 2696 v4 also the shroud does keep those ram sticks cool and prevents extra warning messages but there will be this warning message just hit enter to skip it but i cannot find a way to move past the 539 warning screen on the hp z440 it does boot up fine and work fine and all of the ram is present this is where i have our deepseek stored and that is quite a large file so do keep in mind if you're thinking of this that is about 400 and some gigabytes right there you can see that we've got our e5 2650 v4 processor of course the ram that we've got there also and on the right hand side you can see h-top running and on the bottom btop so kicking it off here just with a howdy to get an idea on the tokens per second it looks like we're getting two tokens per second response and prompt tokens 2.9 tokens per second total tokens 22 next we'll run a much larger prompt and see what we get back as far as the inference speed and this will be much slower so i am definitely going to do some cutting here just keep in mind if you want to check out all of these you can go to digitalspaceport.com and look for the $500 uh cpu inference article that i'm going to put together and you can see that came down to 75 so a lot of degradation 3.26 prop token and,279 total tokens that was not fast at all i also wanted to check this out on the 2696 v4 since i had that installed for some other testing and you can see that i do have the 3090 in there also but that is not using it at all we got two tokens per second in 3.2 return exactly the same as we got on the 2650 v4 however i think that would change so
keep that in mind but yeah so we uh definitely if you're looking at the longer response for armageddon with a twist do see a pretty big difference versus the 2650 v4 there 3090 not running at all while we were doing this but it thought for about 7 minutes and then it gave its answer and this was also a long time but it was not nearly as bad as the 2650 coming in at 1.3 response tokens per second and 6.35 prop tokens per second so that is the highest end and i mean a $5 cpu i guess is the lower end so yeah we'll call that a good uh evaluation a huge question is what kind of tokens per second could i get with other models and so i threw qwq on here and boy qwq takes a long time to think through things without a doubt it is one of the reasoning reasoners that was 1.6 tokens per second and about 4.5 on the prompt next let's check out something much smaller gemma 3 and look at that that is really fast it is a q4 model 15 response tokens per second and 20 prompt tokens per second next we're going to check kito 14b and this is a q8 so this will be quite a bit slower here also but definitely we're seeing that the size of the model does impact of course if you've you got a reasoning model that's going to impact it a lot 3.7 and 9.5 on kito there
so next up we're going to ask it a question about some ancient egyptian trivia factoids i find this to be pretty interesting and usually gives me a unique result among each each one of the llms that i test this on you can see that it's got some degradation here and i would say kito is not one of the chattiest models out there it does take its time but you can also see that we're not burning that many watts while we're doing this this is with it thinking that's not that bad and you can see we got 3.37 and 12.8 next up let's move to something that is substantially smarter gemma 3 12b q8 and this is kind of where i think most people should baseline and i wanted to make sure to show this because keep your eye on the upper right hand side on the mem and you can see we got 4.1 tokens per second and 7.5 prompt tokens per second very very acceptable and next we're going to throw at armageddon with a twist this one i'm going to let play out we're going to speed it up but i'm going to let it play out because there is some hilarious stuff in here and it goes by even a little too fast for this so be sure to visit digitalspaceport.com where you can read the entire thing because there is some hilarious stuff and keep looking at that uh memory up there in the upper right hand side and yeah if you had 16 gigabytes and a cpu that had good and i mean really good bandwidth you would be able to probably hit something quite similar to this this is a $5 cpu after all the 2650 v4 used to be an $1,100 cpu uh msrp so times have changed and prices have have of course changed on it however the 2696 v4 about $140 to $150 cpu currently i believe that is the top end of the broadwells for the e5 2600 lineup so if you did have 16 gigs of ram and you had uh let's say a e520 650 v4 you could be getting pretty much this exact same experience without even a gpu so gemma 3 12b q8 at that looks like it's doing good 3.68 and 15 prompt tokens per second that's actually pretty usable so if you're looking at performance-wise for mid-range models definitely there is a huge argument to be made for dual 3060s uh there's a lot of other gpus you could use and there will be a video out shortly with a 3090 i got to steal it out of here but there will be a video with a 3090 in the z440 because a lot of people actually asked me about that and it does work i'll just say this it does work you can check the writeup all that stuff is linked in the description below but yeah if you're looking at the performance difference on mid-range models let's say gemma 3 and a q8 that's a actually pretty good model if you're looking at the 12b i think that's where i would say you have a good experience cut off that can run in 16 gb of footprint pretty awesome and with 24 gb gpus as you saw in the prior video that we just released on that it does actually really good so a $750 rig can do pretty good if you just had 16 gb and some cpu now keep in mind not all cpus are created the same quad channel is something that you know it was more focused towards enterprise and workstation in the past and the broadwell lineup actually gave about 75 peak gigabytes per second that's a lot of throughput and bandwidth and everything comes down to that at the end of the day so that is why you're able to see these kind of performance numbers that's the same reason that the rome chip that on the uh deepseek original video that i did for the $2,000 build was basically about double this performance because it came in at about 200 max theoretical gigabytes per second now i will say in both of these instances i'm running suboptimal this or that and that almost always happens still it's enough to get the performance that we saw here today and while it's not 19.6 response tokens per second and 335 is about a six to 6.5 difference
in performance so it's about 6.5 times faster that is pretty significant but that is also just a cpu so that's pretty interesting now do keep in mind this may not apply to all cpus and i don't i can't tell you why but there are a lot of cpus that are modern cpus that have not had a lot of focus from the manufacturers placed on this you can find all of these numbers and the comparison between them on the 750 rig and the 500 rig that are linked over at digital spaceport.com of course talking about deepseek 671b in either of these scenarios you're not going to be able to run it on gpus in something like a z440 and that is because you don't have enough gpus and the gpus you would be able to put in there don't have enough vram you may be able to minorly offset it but you're not going to be able to make a huge impact on the vram footprint however you can offload certain portions to it with k transformers and that can dramatically speed things up so maybe if you were able to get a couple of gpus in there uh 36 maybe 24 gb you might be able to see some layers offloaded that have good performance if you're able to utilize that i'm going to have some more write-ups on more complicated things that come down the pipeline i know a lot of people have asked me why ollama and while ollama is one of the slower implementations out there now working on their own engine to my understanding but definitely one of the not fastest implementations out there it is the easiest as a result very very large numbers of people utilize ollama next let me ask you the audience i know there's probably some z440 experts out there do you know how to get around error 539 if you do let me know i have looked all over the bios i cannot find it i have uh bridged the front usb thing so i don't get that warning maybe there's something uh i can jumper and bypass it if you know let me know for sure uh it looks like the degrade in performance between the topend cpu and a very cheap $5 cpu it is there but neither one of these are really recommendations it's like i'm not saying you should go out and get 512 gigs of ram so you can run dc 671b in a few tokens per second on a hello that's pretty much all you're going to get and the reasons for this are multiple and this is why this is going to be a little bit longer of a conclusion it is not going to stay the way it is today not even like a month from now we're going to have new flavors of new best models they're going to have different sizes we don't know what that is yet however there is one way that you can hedge against things in uncertainty and that is have flexible systems that's why the system that i put together originally the first recommendation was a quad gpu rig and at the time when 3090s were still cheap it was actually very affordable for what you were getting and it also had i knew this at the time also the capability to run tremendous amounts of system ram at very high memory bandwidths about 200 gigabytes per second like i mentioned earlier now the broadwell series capping at about 75 or so gigabytes per second is substantially lower but that still is fairly respectable of course skylakes and stuff like that get up there a little bit faster and your ram speed when you're down this low may or may not have some impact that is pretty serious i i don't know you sound off and let me know on that one i mean i'm using 2133 ram here 2133 ram and it seemed like it was doing pretty good the epic system it performed it did perform faster actually by roughly two maybe a little bit better times so i mean that's what we saw when we did the testing on the amd epic and so looking at what you get the best bang for your buck is definitely the thing that i would recommend everybody always do and maybe you are just very well off and getting the latest and greatest 512 gb mac ultra m3 is something you're cool with and hey that is $10,000 but maybe that works out for you i still love gpus for a lot of reasons but mainly because they have the greatest number of applications that are going to work with them and they have the widest range of support it's very well known out there and you can definitely see it in just a few queries that that is still the way that things are going and trending that is getting better and better but it still is heavily favored towards nvidia and cuda right now especially for things like image generation and video generation some important topics that we did not cover here in this video today on cpu inference while gemma 3 does have actually looking at and analyzing photos in a local environment i'm not sure that it would perform very well or not i i should have tested that out maybe i'll test that out in the 3090 video kick off the gpu and just give that a shot but definitely i think if you're looking at 128 gb in some sort of a home server well first off let me say 128 gb 64 gb to 128 gb is badass you are going to have a really great experience because you're not going to be running out of ram and that's almost always the reason that you have limitations i will say this a lot of people recommend quad core you don't see this channel recommending quad core anything for good reason 8 core is pretty much where you should be today so definitely i think if you're looking at something like a 22 core which is insane a 22 core 44thread cpu you might not need to go quite that crazy now of course that's a single socket system that we're talking about keep in mind broadwell you can actually use dual socket machines also amd skylake uh epic most of the other server stuff you can and that gives you an amazing capability but big butt that also gives you an interconnect limiter i now suspect what actually happened because these were the numbers on paper that i had expected to hit when i was checking out the r930s in the first video i don't think i was able to hit them because of qpi and i did jam a lot of ram in there and they have four sockets in each one of the r930s those are broadwell v4 chips also those are the toppest end 8890 v4s pretty banger but that qpi link is where everything slows down because that is not operating at the actual full ram bandwidth capabilities of the system so i now have something else to test i don't know if i can even take out cpus and operate those without full cpus i don't know but i might give that a check out here and just see whether or not that's possible we saw about 180 190 be the top peak in just cpu inference that was on the 2696 v4 on the 2650 v4 i saw about 160 that's a bit high but that's not crazy high both of those idle down really good and so if you're looking at your idle states they drop down to about 55 to 60 that's not that bad given the amount of pcie lanes that are active and all the usb and all the internals and the system fan and i mean i don't know what fan that they have installed in there but it's kind of an oem thing and i'm sure it's not the most efficient thing in the world if you got 32 gigs or 64 gigs or 128 gigs you would be actually pretty happy with some of the cpu inference that you could offload to your cpus for certain models that might run just fine there and that gives you more room on your gpus that you're able to balance things out that is intelligently thinking about things and so definitely i would check out 32 gb as being kind of your minimum and trying to get to 64 i think you'll be really happy if you can run another ancillary model at the same time it certainly will increase your capabilities so the other thing that i would say is that when you're looking at pain in the butt for systems to build workstations like the hp z440 very easy a lot of people ask me about z640 z880s some of the newer g things all good all of those also linked in the description below if you're looking for more information and you can check out digitalspaceport.com i've got some stuff that's just been laying around that i've checked out and i've got to say the ability to actually get two gpus and 24 gb of vram and even a single 3090 working just perfectly fine even at max in a z440 has been pretty cool lr dims not completely supported officially so you see that weirdness that we saw there however i did read the z640 and z840 i think 880 uh those look like they did not have the same problems so let me know if you're a specialist i know a lot of people have a lot of knowledge around hp i've mainly been a dell person myself and usually lr-dimms don't present a issue in most of the dell systems i've dealt with but most of those are also like rack servers and stuff like that or the toppest end tower servers like like a t620 and so my biggest conclusion for you is at the end of the day have a budget that allows you to buy something that you can grow in the future is not a bad idea because that allows you to add on new capabilities in the future i think that starting with one thing that you can add on a piece here or there as you decide or you see the demand to is not a bad idea workstations you can only do that so much but you can do that somewhat and today i showed you an option that i think for a lot of the smaller mid-range models especially that gemma 3 12b i think you really should consider that it's it's got to be the q8 the q4 is but the q8 it's primo for the size so these are my thoughts i look forward to reading your thoughts down below and i mean this is really something that can run anywhere and we see that and i think that's important to keep in mind we're going to see that more and more and i think this omnipresence of systems that can run effectively ai is great dedicated systems with 128 gb like the nvidia digits uh some of the new amd stuff that's out looks good can you expand on those systems that's something i would factor in quite heavily myself especially if you look at things that are incredibly expensive but really nice 128 gb systems like mac m4s and stuff i mean can you add on a gpu or extra vram is kind of a good question sometimes you would like to if you would like to do certain ancillary tasks like video or image so yeah that does actually have implications for people that are on macs also so these are my thoughts and i look forward to reading your thoughts down below everybody have a great rest of your day be sure to hit like and subscribe thank you to all of our channel members and subscribers you can join if you're interested in supporting this channel down below you can also buy me a coffee or join me on patreon everybody have a great rest of your day i will check you out next time
2025-04-17 06:17