So läuft DeepSeek lokal mit voller Qualität

So läuft DeepSeek lokal mit voller Qualität

Show Video

Look here, I run the complete, not reduced, DeepSeek R1 AI locally . Oh yeah, that thing is currently using 709GB of RAM. How I do that will come soon. Why am I doing this? Because you are obviously very interested in the topic of DeepSeek and especially the topic of running DeepSeek locally. Many of you don't really trust the app or the Chinese DeepSeek server. And yes, it seems that the mistrust is not entirely unfounded. But the great thing is that the makers of DeepSeek have put the language model data openly online. So you can run the thing locally. So just start it on your laptop and you're done? Yes, no. If you

want to run DeepSeek in the quality that it runs online on chat.deepseek.com or in the Android or iOS app, you need, for example, 8 Nvidia H200 professional GPUs. Such a system costs at least $300,000. Well, unfortunately I don't have it in my c't-3003 budget. But I was still very interested in whether it was really possible to run the really good DeepSeek R1 and V3 models locally - that is, that you really get the same quality as online on deepseek.com. I don't want to accuse anyone of fraud or anything like that , but I just wanted to prove whether that's really the case - that is, that the freely available DeepSeek weights really offer exactly what DeepSeek promises in the cloud. That's exactly what you see in this video. And I'll also explain to you how you can get the thing running at full quality for significantly less than $300,000. Or at least close.

Here I will also clearly show you why statements like “This is how you run DeepSeek on your Raspberry Pi” are at least misleading. And very important: In this video you can also see whether the local versions are less censored than DeepSeek online. Stay tuned. Dear hackers, dear internet surfers, welcome here to... This video is sponsored by Intel Core Ultra. Support for Windows 10 ends on October 14, 2025. Companies that continue to use it after support ends run security and compliance risks because there are no longer any security updates.

AI PCs with Intel Core are ideal for upgrading to Windows 11 Pro. The notebooks are equipped for current business requirements and are available from Bechtle. The integrated NPU allocates workloads to the appropriate thread and core using real-time analysis, making Intel's AI PCs extremely energy efficient. The vPro platform also offers extensive security and device management functions. That’s why you should now make the step to Windows 11 Pro – with Intel and Bechtle. Find out more on the Bechtle website. Link to it is in

the description. Advertisement end. Yes, again very briefly: The AI ​​chatbot DeepSeek has made a real splash in the last few weeks and caused share prices to crash. We had already made a video about it. What is most relevant to me is that with DeepSeek, for the first time, there are language models that can seriously compete with ChatGPT from OpenAI or Claude from Anthropic. There were always decent open language models, for example the LLaMAs from Meta, which were okay, but in my opinion they couldn't really keep up with the commercial competition. DeepSeek is now for the first time. And that really is a huge difference: whether you throw your prompts into a black box in the cloud - i.e. on chatgpt.com or into all the apps - and no one really knows what happens to them on the servers in the USA or China or somewhere else.

Or whether it really takes place completely locally on my own hardware. Maybe not even with external internet access. Of course, this is especially true if you have a company and company data that requires protection. It can be quite practical, like an LLM for company things.

DeepSeek has two competent models: the normal one, i.e. V3, and the reasoning model, which thinks first before giving an answer. That means R1. R1 takes longer to answer, but is particularly better at mathematical, technical and programming questions. It just takes longer. The thing is: both need DeepSeeks immensely fat hardware. For example, 8 × H200 GPUs are recommended. That will cost at least $300,000. And this isn't hardware that you just put under your desk because it's super loud, gets extremely hot, needs air conditioning and so on and so forth. Unfortunately, this

also applies to most server hardware. If you don't have something like that, there are two alternatives: Either you live with lower intelligence and run a dumbed-down model on smaller hardware. Or you can take the unchanged model and run it on comparatively cheap CPUs instead of expensive GPUs - i.e. CPUs and normal RAM instead of GPUs and VRAM.

The problem, however, is that LLMs run significantly better on GPUs. Simply because the memory on GPUs has a much higher bandwidth than normal RAM. The consumer graphics card with the highest memory expansion at the moment is the Nvidia RTX 5090. But it only has a measly 32 GB of memory. You can't really buy it at the moment. Costs over 2000 euros. DeepSeek needs 720 GB. So you would need 23 of these graphics cards. And let me just say: I can't fit 23 graphics cards on my motherboard. So the only alternative is the CPU. But we

just need at least around 720 GB of RAM. And normal office or gaming PCs only have four RAM slots. There are RAM bars with a maximum of 64 GB. That works out to 256 GB for consumer desktop PCs. So you need a workstation or a server motherboard. Server CPUs can connect more RAM, and the boards also have more slots. Here I clicked together something with 768 GB. That brings me to 5870 euros. Also a lot of money, of course, but significantly cheaper than GPUs. Or you can take a look used: Here is an old workstation with 768 GB of RAM for 2700 euros.

Yes, and I thought about getting something like this for this video here to do my DeepSeek test. So I'm just thinking about it. Then my colleague Christoph Windeck, who, by the way, does a good podcast about CPUs - is linked in the description below, says: "Hey, just take the AMD Epyc server that we have here. We only need it sometimes to test new server CPUs. It has 1.5 TB of RAM." And I'm like, "What?" Okay, well, I've been working at c't for many years,

but I didn't know that something like this existed. Well, I quickly had the root password and got started. Here you can see the Inxi output: So two AMD Epyc 9754 CPUs with 128 cores each and, as I said, 1.5 TB of DDR5 memory with 4800 mega transfers per second. I used

Ubuntu Server as the operating system and tapped into the LLMs with Ollama. This has become somewhat the standard software for LLMs for non-professionals. And because Ollama only runs on the command line, I installed this fancy web interface with Open WebUI. This means you can make a local LLM available to several people in your own network, including people who are not as nerdy as we are. Because OpenWebUI is just as easy

to use as ChatGPT or chat.deepseek.com. I'll show you how to install Ollama and Open WebUI at the end of the video. Yes, but now the big question: How is DeepSeek working now? Is it as good as in the cloud? So first load it into Ollama. The way to do this is to go to ollama.com and then see the available LLMs in the library. Yes, this is DeepSeek R1. drop down menu. Yes, 671 billion parameters. Looks right

. This is how many parameters the model has. But huh? 404 gigabytes? Isn't that a bit small? Ah, okay. It says Q4_K_M. This means quantized to 4-bit, although DeepSeek actually normally works with 8-bit. Which, by the way, is very economical, because LLM data is normally available as FP16, i.e. as floating point numbers with 16 bits. The DeepSeek weights are FP8, i.e. 8-bit floating point numbers. It's much more economical. The thing now is that Ollama doesn't support floating point weights. That's why I have to

use the Q8 variant, which, to put it simply, is simply a conversion of floating point numbers. I could now go into the difference between INT8 and Q8. Uh, I think this is going to be a bit too much of a computer science lecture for me. What's important is that the Q8 data that I use from the R1 and V3 DeepSeek models is the same size as the original FP8 data, just converted. If you now say: "Huh, why didn't you use the original weights? DeepSeek writes what kind of software you can use." Yes , there are seven options. sometimes even in the wording . also answered a number of problem questions correctly that even some commercial LLMs like Google's Gemini Advanced 2.0 Flash can't answer. For example: "Write a grammatically

correct sentence without using a letter more than once." Yes, Gemini 2.0 Flash just throws out sentences that simply use letters several times, and then after asking three times, DeepSeek just gives up. Yes, great. No multiple letters. This mathematical question also gets interesting : "A two-digit number has the sum of 12. If the digits are swapped, the number becomes 1.75 times larger. What number does this property have?" DeepSeek R1 correctly throws

out 48. And now, for comparison, here is the DeepSeek R1 version with only 32 billion parameters. Somehow it does the math and then says: There is no number that meets the specified conditions. By the way, this is actually not a DeepSeek model, but rather a Qwen-2 model trained with DeepSeek with 4-bit quantization, i.e. a so-called distilled model. The advantage is that That only needs about 20

gigabytes of memory. That means it also runs on large consumer graphics cards. In our last DeepSeek video, some of you recommended the thing in the comments. But it's clear: the 32 billion DeepSeek version is okay, but it's no comparison to the real DeepSeek with 671 billion parameters. Also the question: "There are three people in a room, Sahin is reading a book, Lukas is playing chess, what is Pascal doing? third person in the room?" the large R1 model that I run locally answers correctly with: "Pascal plays chess with Lukas because you need two people for chess." The 32 billion parameter version,

on the other hand, says bluntly: "You can't find out what Pascal is doing." Okay, you actually notice a difference in intelligence. And it's not just logic things like that that the big models are much better at, even when it comes to creative writing. For example, when I asked for jokes about the c't editorial team, DeepSeek R1 gave me the following: Attention, attention, this is about to be gendered. Don't freak out here and call the fire department straight away or something. I read it the way DeepSeek gave it to me, and DeepSeek replied: "How many c't employees do you need to repair a coffee machine? One to fix the error and three to log in the log that the comma in the error message was placed incorrectly ." That's really funny, especially because it's pretty true. DeepSeek R1-32b also does interesting things, like somehow outputting the jokes in Python, but honestly that only makes half sense. So how does c't-IoT security test?

Do you SSH into the coffee machine only to find that it responds in hexcode? Okay, huh? It gets really funny, but unintentionally, when you take the mini-small model with 1.5 billion parameters, which only uses a little more than a gigabyte of memory, i.e. 700 times less than the real one. These are the things that run on Raspis or something like that. So just think about what kind of quality this is. I read the joke unchanged: “The dramatic chess walter has described himself not only as a sprackengal, but also as an exciting, exciting gang in Mathias player or break infantry. It is like spiritual neck mamour that surrounds the spiritual break of Mennowelt.” Invented your own language straight away . And if someone now says: “Yes, the little models don’t speak German!” - So even in English it doesn't really make sense what they say. It's also nonsense.

The 32B version? Yes, at least in English she dropped a very nice c't joke: "What do you call it when c't editors argue? A syntax error in communication." But still not as good as the 671 billion parameter version, which really works at ChatGPT level. Yes, but what about the censorship that everyone is always talking about at DeepSeek? Because on the Internet, DeepSeek constantly breaks off the conversation if it somehow goes in a direction that is problematic from a Chinese perspective. So often with completely harmless things, but also, for example, with the violent suppression of the protests on Tian'anmen Square on June 3rd and 4th, 1989.

Yes, and the local version? It doesn't break off. You can talk to her about it. So with a clearly recognizable Chinese perspective, but that's clear, this is a Chinese model. But the blunt censorship, i.e. the termination, which often happens in the cloud version when you enter completely harmless things, simply doesn't exist . So DeepSeek tells me clearly here that the Tiananmen crackdown was classified as problematic internationally, but as a necessary measure in China. Yes, that's right. I actually find it quite interesting that you can specifically see how things are seen from a Chinese perspective. Because most other

models have such a clear US focus, and that's a relatively narrow focus when you look at the whole world. But of course you should know what kind of LLM you are currently interacting with and what its focus is. It's important, media literacy and stuff like that. Now finally technical again: How do the two large DeepSeek models actually run on our CPU-only server? Yes, not so fast, I would say. I measured three to four tokens per second on average on the server with the two AMD Epyc 9754 CPUs.

Can you work with it? Yes, but of course it's not really great. And especially because I had problems keeping the part stable in memory during my experiments. And this despite the fact that I had already instructed Ollama to keep the thing in RAM indefinitely.

This also works great with smaller models. With the big rubble here, yes, there were always problems with Ollama getting stuck and not wanting to spend any more. I then always had to stop Ollama manually using “systemctl stop Ollama” and start it again. And the thing is, just loading it into the storage always takes around 20 minutes. You can see here with htop how the server is slowly filling up the memory. The output only starts when everything is there. And in practice for me with the R1 and V3 models, according to htop, that was

711 gigabytes each. And yes, that's exactly right because the weights on the SSD are that big. Here "Ollama list" outputs 713 gigabytes each. So you should be able to get there with 768 gigabytes of RAM. Yes, and there are also the 4-bit quantized versions that make do with 404 gigabytes. How much worse are these than the 8-bit version? Yes, and I have to say,

this is very difficult, pun intended, to quantify. In any case, it delivers very good results. But now it's difficult to say that it is so many percent worse. My conclusion: I think I only spent so much time tinkering with very few 3003 videos. So you may have noticed that too. There is so much to tell and the topic is also quite complex. But in any case, my original question is

at least partially answered: Yes, DeepSeek obviously really delivers with the open weights the same thing that can be seen on chat.deepseek.com or in the mobile app. But in any case, it's a big deal, because DeepSeek R1 and V3 are the first AI models that I can touch locally, so to speak, that can really keep up with the big top dogs from OpenAI, Anthropic and Google. So nothing against Llama and Co., they are also usable, but always a little worse than ChatGPT. And DeepSeek is on par, sometimes worse, but sometimes a little better. And good LLMs are simply practical for a lot of things, as you can see here. Even if not everyone in our editorial team agrees with me: I think there are a lot of people who want to run DeepSeek locally.

Above all, it is not as censored as online, but above all you know that your own data remains on your own hardware. Exactly, and this question often comes up : “Is it really so safe to run a local LLM like this from a source that you don’t know very well?” And you definitely have to first consider that an LLM is not an executable program that can somehow directly contain malicious code, but that it is data, i.e. weights, and that must first be tapped by software like Ollama. Sure, the LLM could try to manipulate the tapping software somehow or exploit loopholes, but that's pretty academic because, for example, the LLM doesn't even know who is currently evaluating it. I've spoken to a few people who know a lot about security

and they say that the risk is pretty small. So feel free to play around with the DeepSeek weights, the risk is manageable. But one thing is also very clear: running 671 billion parameter models on the CPU isn't really great. So, I wouldn't recommend setting up a server for several thousand euros, even if it's used, because it's just too fiddly and, above all, too slow. There's really a lot of brainpower

going into the whole LLM story at the moment . So let's see what happens in the future. I can well imagine that there will soon be cheaper and hopefully faster options for running something like this locally. Yes, and of course the smaller models, for example the Qwen DeepSeek with 32 billion parameters, definitely have their right to exist. For example in applications like RAG. This means Retrieval-Augmented Generation and means that the language model does not answer the questions primarily from its own memory, but from external data, for example from PDF files or something. You simply don't need huge language models for something like this.

Yes, you can see, huge topic. Feel free to write in the comments what you think about the whole thing and of course subscribe to c't 3003. You'll get a nice LLM answer from me in a staged manner. But I wanted to point out the thing from the intro just beforehand . This is the Make-OxoCard special. There's

a little set that you can use to do electronics crafts. The special thing is that – unlike Arduino or ESP32 – the small control unit here has a display. That means you can control it directly with such a small control pad. The projects are in here and are also suitable for beginners. You can order, link is in the description. And now: You install Ollama under Linux. So, you can easily

install Ollama on Windows and macOS via the website. On Linux you install it with this command here. By the way, the commands are all in one document, which I will also link to in the description. Yes, and then Ollama only runs on the command line. If you also want to have the beautiful web interface Open WebUI, you can do that with these two Docker commands. I ran this in Docker - which is the recommended variant for Open WebUI. Then

you have to briefly edit the Ollama configuration again with this command here and then write this line under “Service”. You need this so that Open WebUI can see Ollama – so that it can also tap into the models installed by Ollama. And then Open WebUI would also have to see Ollama models. If you don't have one installed yet, you can do this directly in Open WebUI using the mouse, otherwise just use the command line via ollama run. You can find the list of models on ollama.com. So, and finally

a wonderful edition of the 1.5 billion parameter model, i.e. the mini-small model, to the question: "Who am I?" (Jan-Keno) Janssen is not a pubinized, Norbornian ordeal symposium but a stochastic game of Jan and Hans, who excel with the Waldschberg's dice . In the Jan Keno game, a ball rests on the random number "1", which is the base of the random number ensembles. The rules of the Jan-Keno game: Real-to-catch rule: Each learning position ormitator deducts the digit numbers from 0-6 to the base "1" if one of the digits "0", "2", ..., "6" is laid down. The digit numbers from 7-9 remain intact as in the standard. Stochastic planting: On the ball arrow pefizist draws one of the dice ("1", "2" or "3") with a probability of 1/3. The pefizist-catchable rules of the Jan Keno game

are based on the random number “2”. Simple to feedback: Each learning image from the game dev can roll the dice from 0-6 to the base "1" if one of the numbers "0", "2", ..., "6" is displayed. The digit numbers from 7-9 remain intact as in the standard. Ownership: The rules of Jan Keno Play are controlled by Jan and Hans and were developed in the 80s. Vulnerability: Any learning site ormitator who wants to deliver the dice symbols "1" and "2" to the base "3" sets the play for everyone to a closed state in feedback form. Random stop probability: In the Jan Keno game,

one of the dice symbols ("1", "2" or "3") is subtracted to the base "1" with a probability of 1/4 if one of the numbers "0", "2", ..., "6" is placed. Especially for Janssen: The rules of the Jan Keno game are not too likely. But they can be controlled in a game dev. In Jan-Keno-Play the dice symbols are “1” and “2”. The digit numbers from "3" to "6" remain intact as in the standard. The rules of the Jan Keno game are

banned by hand because they are not satisfied. They can therefore be controlled in a game dev. Advantages of the Jan Keno game: Simply too feedback-like.

2025-02-17 17:47

Show Video

Other news

PIONEER SA-606 UN VERO DISASTRO #vintage #audio 2025-03-30 23:05
How to Level Up Your Java Streams with JDK 24's New Gatherers Feature (Practical Tutorial) 2025-04-01 20:25
Failed Car Tech in Africa:10 Revolutionary Innovations That Vanished 2025-03-26 09:45