Deepseek R1 671b on a $500 AI PC!

Deepseek R1 671b on a $500 AI PC!

Show Video

running the full deepseek 671b on a $500 machine  is possible and i'm actually surprised a little   bit at the tokens per second we're able to  get with it and this is on a $5 processor   that's right you heard me right a $5 processor  is what enables this along with 512 gb of ram   and a very cheap hp z440 all of that is about $500  total so i'm sure you've got a lot of questions   let's put this system together and start answering  them and you can use the chapters below at   any time to skip around any part of this video  including the conclusion if you're interested in   getting straight to the takeaways this is going to  be a pretty long conclusion on this as well all of   these stats are recorded at digital spaceport.com  and the link to that is in the description below   and it's really really simple all we're going to  do is pop in the ram that is 64 gb lr dims and   undocumented yet functional yet not necessarily  functional the way you would hope for in an hp z440 and also make sure that you have good tight  contact on your cpu whichever cpu you want to go   with and we are going to check out the 2696 v4  also the shroud does keep those ram sticks cool   and prevents extra warning messages but there  will be this warning message just hit enter to   skip it but i cannot find a way to move past  the 539 warning screen on the hp z440 it does   boot up fine and work fine and all of the ram is  present this is where i have our deepseek stored   and that is quite a large file so do keep in mind  if you're thinking of this that is about 400 and   some gigabytes right there you can see that we've  got our e5 2650 v4 processor of course the ram   that we've got there also and on the right hand  side you can see h-top running and on the bottom   btop so kicking it off here just with a howdy to  get an idea on the tokens per second it looks like   we're getting two tokens per second response and  prompt tokens 2.9 tokens per second total tokens   22 next we'll run a much larger prompt and see  what we get back as far as the inference speed   and this will be much slower so i am definitely  going to do some cutting here just keep in mind   if you want to check out all of these you can  go to digitalspaceport.com and look for the   $500 uh cpu inference article that i'm going to  put together and you can see that came down to 75   so a lot of degradation 3.26 prop token and,279  total tokens that was not fast at all i also   wanted to check this out on the 2696 v4 since i  had that installed for some other testing and you   can see that i do have the 3090 in there also but  that is not using it at all we got two tokens per   second in 3.2 return exactly the same as we got on  the 2650 v4 however i think that would change so  

keep that in mind but yeah so we uh definitely  if you're looking at the longer response for   armageddon with a twist do see a pretty big  difference versus the 2650 v4 there 3090 not   running at all while we were doing this but it  thought for about 7 minutes and then it gave its   answer and this was also a long time but it was  not nearly as bad as the 2650 coming in at 1.3   response tokens per second and 6.35 prop tokens  per second so that is the highest end and i mean a   $5 cpu i guess is the lower end so yeah we'll call  that a good uh evaluation a huge question is what   kind of tokens per second could i get with other  models and so i threw qwq on here and boy qwq   takes a long time to think through things without  a doubt it is one of the reasoning reasoners that   was 1.6 tokens per second and about 4.5 on the  prompt next let's check out something much smaller   gemma 3 and look at that that is really fast it  is a q4 model 15 response tokens per second and   20 prompt tokens per second next we're going to  check kito 14b and this is a q8 so this will be   quite a bit slower here also but definitely we're  seeing that the size of the model does impact of   course if you've you got a reasoning model that's  going to impact it a lot 3.7 and 9.5 on kito there  

so next up we're going to ask it a question  about some ancient egyptian trivia factoids   i find this to be pretty interesting and usually  gives me a unique result among each each one of   the llms that i test this on you can see that  it's got some degradation here and i would say   kito is not one of the chattiest models out there  it does take its time but you can also see that   we're not burning that many watts while we're  doing this this is with it thinking that's not   that bad and you can see we got 3.37 and 12.8 next  up let's move to something that is substantially   smarter gemma 3 12b q8 and this is kind of where  i think most people should baseline and i wanted   to make sure to show this because keep your eye  on the upper right hand side on the mem and you   can see we got 4.1 tokens per second and 7.5  prompt tokens per second very very acceptable   and next we're going to throw at armageddon with  a twist this one i'm going to let play out we're   going to speed it up but i'm going to let it play  out because there is some hilarious stuff in here   and it goes by even a little too fast for this  so be sure to visit digitalspaceport.com where   you can read the entire thing because there is  some hilarious stuff and keep looking at that   uh memory up there in the upper right hand side  and yeah if you had 16 gigabytes and a cpu that   had good and i mean really good bandwidth you  would be able to probably hit something quite   similar to this this is a $5 cpu after all the  2650 v4 used to be an $1,100 cpu uh msrp so   times have changed and prices have have of course  changed on it however the 2696 v4 about $140 to   $150 cpu currently i believe that is the top  end of the broadwells for the e5 2600 lineup   so if you did have 16 gigs of ram and you  had uh let's say a e520 650 v4 you could be   getting pretty much this exact same experience  without even a gpu so gemma 3 12b q8 at that   looks like it's doing good 3.68 and 15 prompt  tokens per second that's actually pretty usable so if you're looking at performance-wise for  mid-range models definitely there is a huge   argument to be made for dual 3060s uh there's a  lot of other gpus you could use and there will be   a video out shortly with a 3090 i got to steal it  out of here but there will be a video with a 3090   in the z440 because a lot of people actually asked  me about that and it does work i'll just say this   it does work you can check the writeup all that  stuff is linked in the description below but yeah   if you're looking at the performance difference on  mid-range models let's say gemma 3 and a q8 that's   a actually pretty good model if you're looking  at the 12b i think that's where i would say you   have a good experience cut off that can run in  16 gb of footprint pretty awesome and with 24 gb   gpus as you saw in the prior video that we just  released on that it does actually really good so   a $750 rig can do pretty good if you just had 16  gb and some cpu now keep in mind not all cpus are   created the same quad channel is something that  you know it was more focused towards enterprise   and workstation in the past and the broadwell  lineup actually gave about 75 peak gigabytes per   second that's a lot of throughput and bandwidth  and everything comes down to that at the end of   the day so that is why you're able to see these  kind of performance numbers that's the same   reason that the rome chip that on the uh deepseek  original video that i did for the $2,000 build was   basically about double this performance because  it came in at about 200 max theoretical gigabytes   per second now i will say in both of these  instances i'm running suboptimal this or that   and that almost always happens still it's enough  to get the performance that we saw here today and while it's not 19.6 response tokens per  second and 335 is about a six to 6.5 difference  

in performance so it's about 6.5 times faster that  is pretty significant but that is also just a cpu   so that's pretty interesting now do keep in mind  this may not apply to all cpus and i don't i can't   tell you why but there are a lot of cpus that are  modern cpus that have not had a lot of focus from   the manufacturers placed on this you can find all  of these numbers and the comparison between them   on the 750 rig and the 500 rig that are linked  over at digital spaceport.com of course talking   about deepseek 671b in either of these scenarios  you're not going to be able to run it on gpus   in something like a z440 and that is because you  don't have enough gpus and the gpus you would be   able to put in there don't have enough vram you  may be able to minorly offset it but you're not   going to be able to make a huge impact on the  vram footprint however you can offload certain   portions to it with k transformers and that  can dramatically speed things up so maybe if   you were able to get a couple of gpus in there  uh 36 maybe 24 gb you might be able to see some   layers offloaded that have good performance if  you're able to utilize that i'm going to have some   more write-ups on more complicated things that  come down the pipeline i know a lot of people have   asked me why ollama and while ollama is one of  the slower implementations out there now working   on their own engine to my understanding  but definitely one of the not fastest   implementations out there it is the easiest  as a result very very large numbers of people   utilize ollama next let me ask you the audience i  know there's probably some z440 experts out there   do you know how to get around error 539 if you  do let me know i have looked all over the bios   i cannot find it i have uh bridged the front usb  thing so i don't get that warning maybe there's   something uh i can jumper and bypass it if you  know let me know for sure uh it looks like the   degrade in performance between the topend cpu  and a very cheap $5 cpu it is there but neither   one of these are really recommendations it's  like i'm not saying you should go out and get   512 gigs of ram so you can run dc 671b in a few  tokens per second on a hello that's pretty much   all you're going to get and the reasons for this  are multiple and this is why this is going to be   a little bit longer of a conclusion it is not  going to stay the way it is today not even like   a month from now we're going to have new flavors  of new best models they're going to have different   sizes we don't know what that is yet however  there is one way that you can hedge against   things in uncertainty and that is have flexible  systems that's why the system that i put together   originally the first recommendation was a quad gpu  rig and at the time when 3090s were still cheap it   was actually very affordable for what you were  getting and it also had i knew this at the time   also the capability to run tremendous amounts  of system ram at very high memory bandwidths   about 200 gigabytes per second like i mentioned  earlier now the broadwell series capping at about   75 or so gigabytes per second is substantially  lower but that still is fairly respectable of   course skylakes and stuff like that get up there  a little bit faster and your ram speed when you're   down this low may or may not have some impact  that is pretty serious i i don't know you sound   off and let me know on that one i mean i'm using  2133 ram here 2133 ram and it seemed like it was   doing pretty good the epic system it performed  it did perform faster actually by roughly two   maybe a little bit better times so i mean that's  what we saw when we did the testing on the amd epic and so looking at what you get the best  bang for your buck is definitely the thing that   i would recommend everybody always do and maybe  you are just very well off and getting the latest   and greatest 512 gb mac ultra m3 is something  you're cool with and hey that is $10,000 but   maybe that works out for you i still love gpus  for a lot of reasons but mainly because they   have the greatest number of applications that  are going to work with them and they have the   widest range of support it's very well known out  there and you can definitely see it in just a few   queries that that is still the way that things  are going and trending that is getting better   and better but it still is heavily favored towards  nvidia and cuda right now especially for things   like image generation and video generation some  important topics that we did not cover here in   this video today on cpu inference while gemma  3 does have actually looking at and analyzing   photos in a local environment i'm not sure that  it would perform very well or not i i should have   tested that out maybe i'll test that out in the  3090 video kick off the gpu and just give that a   shot but definitely i think if you're looking at  128 gb in some sort of a home server well first   off let me say 128 gb 64 gb to 128 gb is badass  you are going to have a really great experience   because you're not going to be running out of  ram and that's almost always the reason that you   have limitations i will say this a lot of people  recommend quad core you don't see this channel   recommending quad core anything for good reason 8  core is pretty much where you should be today so   definitely i think if you're looking at something  like a 22 core which is insane a 22 core 44thread   cpu you might not need to go quite that crazy  now of course that's a single socket system that   we're talking about keep in mind broadwell you can  actually use dual socket machines also amd skylake   uh epic most of the other server stuff you can and  that gives you an amazing capability but big butt   that also gives you an interconnect limiter i now  suspect what actually happened because these were   the numbers on paper that i had expected to hit  when i was checking out the r930s in the first   video i don't think i was able to hit them because  of qpi and i did jam a lot of ram in there and   they have four sockets in each one of the r930s  those are broadwell v4 chips also those are the   toppest end 8890 v4s pretty banger but that qpi  link is where everything slows down because that   is not operating at the actual full ram bandwidth  capabilities of the system so i now have something   else to test i don't know if i can even take  out cpus and operate those without full cpus   i don't know but i might give that a check out  here and just see whether or not that's possible we saw about 180 190 be the top peak in just  cpu inference that was on the 2696 v4 on the   2650 v4 i saw about 160 that's a bit high  but that's not crazy high both of those   idle down really good and so if you're looking at  your idle states they drop down to about 55 to 60   that's not that bad given the amount of pcie  lanes that are active and all the usb and all   the internals and the system fan and i mean i  don't know what fan that they have installed in   there but it's kind of an oem thing and i'm sure  it's not the most efficient thing in the world if you got 32 gigs or 64 gigs or 128 gigs you  would be actually pretty happy with some of   the cpu inference that you could offload to  your cpus for certain models that might run   just fine there and that gives you more room on  your gpus that you're able to balance things out   that is intelligently thinking about things and so  definitely i would check out 32 gb as being kind   of your minimum and trying to get to 64 i think  you'll be really happy if you can run another   ancillary model at the same time it certainly will  increase your capabilities so the other thing that   i would say is that when you're looking at pain  in the butt for systems to build workstations   like the hp z440 very easy a lot of people ask me  about z640 z880s some of the newer g things all   good all of those also linked in the description  below if you're looking for more information and   you can check out digitalspaceport.com i've got  some stuff that's just been laying around that   i've checked out and i've got to say the ability  to actually get two gpus and 24 gb of vram and   even a single 3090 working just perfectly fine  even at max in a z440 has been pretty cool lr   dims not completely supported officially so you  see that weirdness that we saw there however i   did read the z640 and z840 i think 880 uh those  look like they did not have the same problems so   let me know if you're a specialist i know a lot  of people have a lot of knowledge around hp i've   mainly been a dell person myself and usually  lr-dimms don't present a issue in most of the   dell systems i've dealt with but most of those  are also like rack servers and stuff like that   or the toppest end tower servers like like a  t620 and so my biggest conclusion for you is   at the end of the day have a budget that allows  you to buy something that you can grow in the   future is not a bad idea because that allows you  to add on new capabilities in the future i think   that starting with one thing that you can add  on a piece here or there as you decide or you   see the demand to is not a bad idea workstations  you can only do that so much but you can do that   somewhat and today i showed you an option that i  think for a lot of the smaller mid-range models   especially that gemma 3 12b i think you really  should consider that it's it's got to be the q8   the q4 is but the q8 it's primo for the size so  these are my thoughts i look forward to reading   your thoughts down below and i mean this is really  something that can run anywhere and we see that   and i think that's important to keep in mind we're  going to see that more and more and i think this   omnipresence of systems that can run effectively  ai is great dedicated systems with 128 gb like   the nvidia digits uh some of the new amd stuff  that's out looks good can you expand on those   systems that's something i would factor in quite  heavily myself especially if you look at things   that are incredibly expensive but really  nice 128 gb systems like mac m4s and stuff   i mean can you add on a gpu or extra vram is kind  of a good question sometimes you would like to if   you would like to do certain ancillary tasks  like video or image so yeah that does actually   have implications for people that are on  macs also so these are my thoughts and i   look forward to reading your thoughts down below  everybody have a great rest of your day be sure   to hit like and subscribe thank you to all  of our channel members and subscribers you   can join if you're interested in supporting this  channel down below you can also buy me a coffee   or join me on patreon everybody have a great  rest of your day i will check you out next time

2025-04-17 06:17

Show Video

Other news

Технологии для Apple и для армии РФ: двойные стандарты группы беларусских компаний 2025-04-26 21:55
Tor Browser’s Latest Update Could Get You Fingerprinted... 2025-04-26 18:03
Zack Jackson - ByteDance, rspack, and the Future of Web Development 2025-04-24 18:16