I built an AI supercomputer with 5 Mac Studios

I built an AI supercomputer with 5 Mac Studios

Show Video

1, 2, 3, 4, 5 Max studios. I'm connecting them together and forming a super powerful AI cluster. Why? I want to run the biggest and baddest AI models I can find. We're throwing everything at it and my goal is to run the biggest of them all. The llama 3.1405 B model. This thing is scary. It's normally run by super powerful AI clusters in the cloud with servers that cost more than our houses, but we're going to try it now with five Mac Studios.

Can we do it? I don't know, but we're going to try get you coffee ready? Let's go and thank you to nor VPN for sponsoring this video and making it possible. Yes, they are paying me to play with AI and show you cool stuff. It's kind of awesome. We'll talk more about them later.

Now lemme get this out there. I did not just buy five Mac Studios to use it for an AI cluster. I mean it's not beyond me. I would do that, but here at Network Chuck Studios, we're switching from PC to Mac for our video editing pipeline. Comment below if you think that's a good idea. I'm sure we all agree,

but when these beautiful, powerful machines arrived, I'm like, you know what? I can't give it to these guys yet. I want to play with them first. And I just found this software called XO Labs. It's new, it's beta, but they're all about AI clustering. Check this out. You can take any type of computer hardware. I'm talking a raspberry pie,

a spare laptop, a super powerful gaming PC with a 40 90, and you can connect them together and just have them run AI models. They share the resources. It's actually kind of easy to do. I'm going to show you how to do it in this video, but first I got to open up all these Mac studios and honestly, this was probably my favorite part of the video. I don't know what it is. There's something about opening new tech, unboxing new hardware that just makes you feel joy and it's anything, a network switch, a router. It just makes me happy. Are you the same way? Anyways, I unbox them. They're beautiful. I did smell them one time.

They smelled amazing. But before we get crazy, I first want to talk about AI clusters. Why do this now? I already have a dedicated AI server. His name is Terry. I built him in this video here and he's awesome. He enables me to run local AI models here in my studio, meaning I don't talk to the cloud and rely on scary giant companies like Open AI to run things like chat, GPT. Everything's local. They don't get my data.

But the reason I had to build Terry who's rocking 2 40 90 GPUs is because running AI models, it's resource intensive sometimes because right now your computer, the one you're watching me on, can probably run an AI model in moments. You could download O lama, run LAMA 3.21 B, and it works really well. You can talk to it like Chad GPT, but it's not going to feel like chat GPT, it's not as smart. And you'll notice that really quickly. The difference is kind of crazy. To get the quality of chat GPT, you'll have to use a bigger, more sophisticated local model and this is where your laptop isn't going to cut it. And when I say larger, I'm mainly talking about a thing called parameters.

So I mentioned LAMA 3.21 B. Let's break that down. This is a relatively small model and that one B stands for 1 billion, 1 billion parameters. When you think about a parameter in the context of ai, each one represents learned knowledge.

Each of these parameters is a numerical value or weight in a neural network. And they help the model make predictions and then that's what a model's doing. When you're talking to it, it predicts what the response should be. Based on what you're saying, you can actually think of a parameter as learned knowledge and the more parameters the model has, the more patterns, relationships and nuances. It can learn from data or essentially the more parameters it has, the smarter it is. Now a 1 billion parameter model like LAMA 3.2, it's good for simple tasks. You can talk to it.

It does basic sentence completion. It can summarize stuff and you can run it on things like CPU. GPU is going to be better, but it has weaker reasoning and factual accuracy. I'm kind of using llama 3.2 as our baseline. We could go lower. They have lower parameter models that get dumber, but they have their use cases and if you want to run it on a raspberry pie, you'll have better performance. I think there's one called Tiny Llama. Lemme go find it actually.

And that is the Tiny Llama. It's pretty cute. Now check this out. Tiny Llama is actually a 1.1 billion parameter model, but because of quantization, we'll talk more about that later. You can run it with less resources. 638 megabytes of vra vra, what is that? That's video ram.

So this is not your typical memory or RAM on your computer. This is memory that your GPU has. And yes, when we're talking about running local ai, GPU is the name of the game.

If you have one, your life will be better. It doesn't mean you can't run LLMs like Tiny Llama on A CPU. You can, but the inference or having the conversation will be slower. Now of course we can go up. So with LAMA 3.2 as our baseline, I'll give you some recommended vra, like what kind of GPU you might need for each Model LAMA 3.21 billion

parameters. It's recommended you have four gigabytes of vra. Again, you can use CPU, it'll be slow LAMA 3.2, three B, 3 billion parameters. You'll need six gigabytes of vra, so think a 2060 GPU Lama, 3.18 B 8 billion parameters, 10 gigabytes of V ramm. That's going to be a 30 85 4 from Microsoft, 14 billion parameters. You'll need 16 gigabytes of V ram. That's going to be a 30 90. And then here's my favorite local AI model. Right now,

the LAMA 3.3 70 B 70 billion parameters for this. There is not a consumer GPU that can do it right now. You'll need 48 gigabytes of vra. For me to run that I have to use two 40 nineties. And then let's get crazy. If we go one more up, we've got the llama 3.1,

4 0 5 B 405 billion parameters. Now real quick, one thing you might be wondering, you saw me jump from llama 3.2 then to LAMA 3.1, then to 3.3 and then to 3.1. Again, what's happening? Those are the different generations of models trained on newer data and having a few new features. But just because LAMA 3.21 B is newer, it doesn't mean it's more intelligent or has better reasoning than LAMA 3.18 B. Anyways, getting back to the 4 0 5 B to run this sucker, they recommend one terabyte of vra. That's unreal.

That's going to be an AI cluster and it's not regular GPUs you're going to be using. You'll be using NVIDIA's H one hundreds or a one hundreds and this is what I'm aiming for with my cluster. Think about the best GPU on the market right now. A 40 90, it has 24 gigabytes of vra. I would need 42 40 nineties to run that. Now just so you know, for those of you who might know a thing or two about running LLMs, these numbers probably look a little off and that's because a lot of these already have quantization built into our metrics. What is that? Quantization could be its own video. We're not going to do that right now.

Just know what makes big models fit on smaller GPUs. Now, it doesn't come without a cost. They do have to reduce some precision to get that to fit on a smaller GPU, but they do it in a way to try and maintain accuracy. You'll know a model is quantized, is that how you say it? Quantized? Yeah. That's what you do when you see certain notations. So for example, FP 32, that's full precision, no alterations FP 16 half.

Now when I say half precision, it doesn't mean it's like half is bad. We're talking a zero to 2% loss in precision. But then we get into integer based quantization, and this is where it's fun for us because we can run stuff on our GPUs, our consumer GPUs. The first big one is ENT eight. This will make the model four times smaller with about a one to 3% loss in precision. Now I say that with a giant asterisk. It depends. It depends on how they ize that model.

There are different ways you can do that and those different methods change how they try to reduce the loss. Again, that's a whole other video, but just know as we go down to N four, which is as low as you really want to go. This is eight times smaller than the full FP 32 model, but the loss is pretty big, 10 to 30%. And you'll probably notice the degradation for complex task like coding or logical reasoning or creative text go any lower and it loses its mind.

We're talking ARCA asylum. So many of these models over here are actually using N to eight to make themselves smaller so they can fit on consumer level GPUs. And four is what I'm going to try and use with LAMA 3.1, 4 0 5 B. Now I'm not going to ize it myself. Someone's already done that for me. I'm just going to try and run it. But even with that quantization, it's a tall order.

So how do I expect five max studios to run this model when it would take 42 40 nineties to do this? Well, the new M series max have a trick up their sleeve. It's a thing called unified memory or unified memory architecture. So in most systems you have your system memory and you have your V ram, your GPU memory. The new Macs don't do that. They have one pool of memory for everything and that unlocks something pretty cool because you can get a Mac. For example. In my Mac studios, each one of these has 64 gigabytes of ram that's shared RAM that can be used for the GPU. So in my mind I'm thinking 64 times five. What does that give me? 320 gigabytes of RAM that can be used for the GPU.

And it's not just the amount of ram but it's the transfer. In a typical system, you've got the system memory that has to transfer data between itself and the GPU memory. With unified memory, there's no transfer. It's just all using that memory. One of these max studios is $2,600 and that's for the entire computer. 1 40 90, just one piece of your gaming PC will cost you 1600 bucks and I get way more RAM to use from my GPU with the Mac. Not to mention it's extremely power efficient. It's ridiculous.

The power consumption on a 40 90 versus a Mac studio you're about to see, but it's not apples to apples, it's apple to pc. What I mean is that if you put a 40 90 gaming PC head to head with a Mac studio, the PC is going to win every time Nvidia GPU like the 40 90 have dedicated tensor cores. They're optimized for cuda. What does all that mean? Well, those are the things that AI models have been optimized for a long time. SER Max have not been thought of as AI machines up until now. It's just been Nvidia. So whenever someone makes a new model, they're making that model to run on Nvidia GPUs, you're going to have a better time. Now,

apple does have something called MLX or Machine Learning Acceleration, and I'll actually be using that with XO labs, but Kuda still wins out because of support. Okay, here we go. We're about to test. We have our five max studios, which by the way, here are the specs. They're M two Ultras, 64 gigabytes each of Ram Unified ram.

Now the first big thing we have to figure out is how do we connect these Macs together? They're going to be clustered, which means they're going to be talking a lot and that's a lot of bandwidth for our scenario. I went with the built-in 10 gigabit ethernet connection. So over here I have a unified XG six POE 10 gig switch connecting these five Macs together. This however, will be our biggest bottleneck, not ideal. Now, 10 gig sounds like a lot, but with AI networking, they normally have extremely high speed connections. I'm talking 400 gigabits per second. In fact, last year I did a video on AI networking working with Juniper, and they were about to come out with 800 gigabit per second connections, which I'm pretty sure is out. So my 10 versus their 800. And it's not just that AI networking and we're talking enterprise AI networking.

They eliminate a ton of the networking overhead that you might see with ethernet and T-C-P-I-P. In many situations we're doing GPU to GPU access, skipping a lot of the OS overhead, but for us, we've got our MAX studios and they have to go through the entire TCP IP stack. Now, the reason this matters so much is that our Macs, when I install the EXO software, it will actually take whatever model we're going to use.

So let's say for example, llama 3.28 B, it won't download the entire model on each individual Mac. It'll actually split up the download. And when we're running our AI model, each Mac will be running part of the job.

But like any good team that depends on efficient communication, they're going to be talking a lot back and forth, extremely large amounts of data. In fact, I'm going to try and see that as we're testing it, we're going to be tracking the amount of power we're using and bandwidth. Now, there is a way with my Mac studios to get more bandwidth, and that's with Thunderbolt. Alex Ziskin, I think it's how you say his name, another YouTuber I just started watching. He did this with a bunch of M four Mac minis. Thunderbolt is powerful because you get direct PCIE access and bandwidth up to 40 gigabits per second. Ideally,

the only problem is when you get to where you want to cluster together, five, you only have so many connections and you can't daisy chain all of them. Now, the way you can solve this is by using a thunderbolt hub or bridge, and that's what he did. But you still will have some bottlenecks. By the way, you should watch this video to see how Thunderbolt performs versus ethernet, which we're about to do right now. Hey, network truck from the future here. I actually ended up testing Thunderbolt because I just had to, the 10 gig was such a bottleneck, you'll see. And yeah, that's all I got back to me. But now we're finally at the point to install xo. I'll have a link to the project below and I will demo how to install XO on a Mac for Linux. They do have documentation, but the install is pretty much the same.

Really, I think the Mac is the harder version. A couple of things you want to make sure you have Python three point 12 installed. I'll go and do that right now. I like to use PI ENV to manage my Python versions. And then with PI ENV, I can install Python three point 12 and I'll do this on all Macs.

And by the way, let's get home assistant up. I'm actually using a smart plug to measure the amount of power I'm drawing with all five Macs. So right now at a kind a baseline, we're pulling 46 watts and that's for all five Maxs. Isn't that crazy? Alright, Python 12 installed. I'll set it is my global PI EMV global three point 12. And I'll just verify real quick with python dash dash version. Oh, I probably need to refresh my terminal. I'll do a source ESH rrc,

try it once more. Perfect. The first thing I'll do is install MLX machine learning acceleration for M1 max. I'll do that with PIP install MLX. Keeping in mind, this is very specific to MAC deployments. Notice it is very quick. And by the way, if you find that you don't have PIP installed, you can get PIP in all the things you need installed with the X code dash select dash install command. Okay, MLX installed on all my Macs. Now time to install xo. This will be the easiest part.

Just going to grab a gate clone command to clone their repo. I'll do that on every one of my Macs here. Jump into that XO directory and then we'll use the command PIP install dash e dot and take a little coffee break. Now while it's doing that, a couple of cool things you'll want to know about xo. First,

you're about to see this when we run xo, the max will just discover each other through magic, I mean through networking stuff, right? But they will automatically discover each other and recognize that they're on a cluster. XO will also launch a GUI for us, a web interface so that we can look at it, play with it, and test some LLMs. Speaking of LLMs, they also have a chat GBT compatible or open AI compatible API, which means that if you actually want to use xo, even though it is still in beta, still fairly new, you can integrate this into anything that also uses the open AI API, which is many tools I use. In fact, I just reached out to Daniel Mesler, the guy who runs the fabric project. I use fabric every day and I'm like, Hey, can you add XO labs to this? He's like, yes, I'm on it. Actually,

by the time you watch this video, it's probably already there. Alright, it looks like our installation is complete. Now this is very max specific. If I ls the directory, I'm currently in the XO directory.

You'll see I have a script called configure ml x sh running that will tune up your max a bit to run xo. So I'll run that on each Mac. I might want to put pseudo in front of that so you don't have to put your password in either way. Notice it did some things, honestly, I have no idea what that's doing. And now we're at the point where we can just run xo. So I'll run it on the first one here. Xo, xo, xo, xo.

And then there's one behind it. There we have five. I can't get to the other terminal. Where yet, buddy? Oh, there he is. Xo, are you seeing this? So immediately XO discovered that there are five nodes in its cluster. Just auto discover. Actually I'm going to stop them real quick. So I want you to see how it rates each machine. So I'll just run one instance of XO right here, a cluster of one. Alright,

so notice here we got 26.98 tariff flops. We're closer to the GPU pore side than the GPU Rich. I'll show my 40 90 performance right here. I'm normally right around here, but when I start running the rest of the cluster here, it'll discover the other nodes. And now when I operate one more, we'll see it. Discover two nodes even shows the connection down here and it increases or doubles my tariff flops. Now I'm going to only operate one right now and let's

test an LLM just to get our base performance for one Mac. Alright, we're down to one cluster. Now when you want to access your gui, it'll be port 5 2 4 1 5 and this is for 10 72 and 1 6 9.

So I'll launch my browser here and there we go. Notice on the left here we can select our model. It won't download it just by selecting it. It's like if I click on seven db it want to do seven db, but I'll click on one B and when I start typing, it'll try and download it. So I'll just say, Hey, how are you doing? Downloading it now and cool, it's working now. And then notice right here we have our performance.

It's documenting for us. And what you want to focus on is the tokens per second. Let's say, tell me a scary story about computers. So averaging about 117 tokens per second just by itself, given that this is a small model and one of these Mac studios can run that, no problem, no sweat. Now what I want to test now is the network bottleneck. If I introduce the other four Mac into a cluster and we divide up the jobs, what will happen? What will it look like? Let's do that now.

So I'm actually going to delete the model here and go add my other Macs. That's what I'm, watch 'em come up here. We got two, three. Clustering together is so easy. Four and five notes. Let's test it out.

Now I'll go ahead and start talking to it to download the model. It's downloading. It looks like the entire model on each one. So maybe I was wrong about that or maybe there's a bug, I don't know.

But either way, we can see our clusters working because it's obviously downloading on every one. Let's try and do that same prompt as before, telling me a scary story about computers. So wow, the bandwidth limitations are massive 29 tokens per second versus the 117 we were doing before. So speed is not going to be our friend here. I expected that.

What I'm more excited about is the amount of RAM we have and being able to run bigger models. Hey, guess what time it is? Coffee break time. During this break, I want to tell you about my sponsor nor VPN. Now hold on, before you click that little fast forward button, just know they make videos like this possible. So please show them some love because I want to tell you three ways I use A VPN right now. Number one, I use it to give me a bit of anonymity. When you're accessing a website, many websites will use your public IP address to identify who you are. They'll use different tracking techniques.

Essentially when you're stepping around on the internet, you're leaving a footprint and they're tracking that. So I'll use North PN to hide me and also kind of tricking websites to think I'm someone else. It's a great tool for IT people to quickly change who you are, your identity online. Number two, watch a ton of movies.

Did you know that Japan, Netflix looks different from American Netflix, same goes for UK and other regions, but if you're using nor VPN, you can quickly put yourself in the uk in Japan, and suddenly Netflix thinks you're in Japan and they show you Netflix, Japan. Now, you may have heard this before, but then my producer Alex did something insane. He uses this app called Letterbox. In fact, Alex throw it up right here.

But one thing it will show you is when you're looking up a movie, it'll you where you can watch that movie on all the streaming services. But he added all the streaming services for every single country. So now when he wants to watch a movie, he can watch that movie before he would just simply rent it when it wasn't available in the us. Now he just turns on VPN changes his location, boom.

And yeah, you can run more VPN on things like an Apple tv. And number three, I'll use A VPN to protect myself and my family when we're away from the house when we're using our devices. And yes, that does include connecting to weird wifi networks. Not many VPN haters will say you don't need a VPN anymore on the internet because most websites are HTT BS, meaning your connection between yourself and the website server is secure.

And that's true for a lot of the internet. That's awesome. But what if you get to a website that does have SSL, but it's not a good website. It doesn't matter if it's encrypted. One way that people do this is a thing called typo squatting typing of your website. So it might be netflix.com instead of Netflix, which in an ideal world would go nowhere. But people buy these websites, bad people and put up a Netflix feeling place. But with nor VPN,

they do have things like threat protection probe. They'll tell you when websites you're visiting are bad. Also, they'll protect you from ads. Ads suck except for this one. This one's awesome and you know it, but I put a lot of effort into blocking ads on my home networks and my business networks. But when we're out and about them,

ads are looking at us getting hungry, turn on nor VPN, it will block ads. So yes, using a VPN is very much a valid thing to do in 2025. I highly recommend it. I use it all the time. So check off the link below nor vpn.com/network. Check or scan this new fancy QR code. Is the QR code safe? I don't know. Scan it, get nor vpn and then scan it again to see if you are safe.

And of course if you use my link, you get a special deal. What are you waiting for? Check this out. Actually it's a New Year's big savings plus four extra months. I'm going to get it three bucks a month. Anyways,

thanks to nor VPN for sponsoring this video and making stuff like this possible. Now back to clustering AI stuff. Now let's see what Thunderbolt does. I'm going to go connect my Thunderbolt stuff right now. Alright, Thunderbolt connected. Now how did I connect these hosts? I'll show you a video of it right now. Here's some B-roll. But essentially we're doing kind of a spoken hub situation. We got one Mac

connected to all the other Macs, obviously a less than ideal situation because this guy does become a bottleneck, but we're talking about 40 gig get per second networking between these guys at a nice little thunderbolt bridge here. Thunderbolt networking isn't quite as advanced as regular T-C-P-I-P based networking or ethernet. So this is the best I could do without pulling my hair out with advanced configs because I don't want to do that. So I assign static ips to all of them. And XO by default should choose the fastest connection. Let's see if it does. And sure enough, we can see that the bridge zero thunderbolt is monitored.

And before we run all of them, I do want to test one host. Well, not one host. We already know how one host performs. Let's test two first. Alright, we got Thunderbolt connection between two hosts should be very, very quick. Let's feed it a prompt and have some fun. Okay, it is slightly faster. I think before with 10 gigabit networking, we're talking about 50 tokens per second. So it's significant. Let's add three.

Join the party friend. Yeah, let's do the same prompt. Still it's better than it was before, but notice even with Thunderbolt, we're hitting that now why is Thunderbolt better, more bandwidth? That's obvious, but it also has a more direct access to PCIE. Less overhead, more direct. Let's add the team. Come on in guys. Alright, we got a cluster of five. Let's see how we do. Okay, that's actually not bad. It's like, wait,

you just told me and then you're telling me you can't make up your mind. Let's try this one. Okay, so all hosts are being used. Let's try this prompt and watch the networking happen. So bandwidth usage is obviously pretty much the same. We expected that right now. Let's test a bigger model. The LAMA 3.3 70 B. My favorite model actually deep seek R one just came out.

I haven't played with that yet, but I'm going to go disconnect the Thunderbolt connections and we'll run 10 gigabit on the first test. Okay, running XO on just one host running the LAMA 3.37 db. I expect it to be pretty good. This is a quantized model four bit, should be able to run everything. And let's see how this performs.

Here's the host right here. Watch the RAM usage. Just go crazy. And the seconds to first token are taken a bit and then the GPU takes off. So 15 seconds to the first token, it was just loaded and the performance is less than great. Now I will say this, we're going to test llama after this. For whatever reason, the models and llama are better. I'm not sure what they're doing.

And when I say better, they seem to perform better. I'm not actually sure if they are better. Okay, let's stop this nonsense. Let's try two 70 B. We'll ask the same question. Take a look at our metrics here and let's go.

Got to download part of it to the other host. Alright, memory's coming in hot, you know it's not doing too bad. Better than I expected. Let's check the networking. All right, networking testing. Now is it just me or is it using less bandwidth than before? That's funny. Alright, let's add 'em all in. All five are having a party. Got our monitoring up, asking a question. Now let's see how it performs. Oh, got to download one more bit. Two minutes.

Killing me. Now this honestly is the most painful part of making this video. And probably for you if you're ever going to use this, it's waiting for these models to download. This video took me way too long. I anticipated one day for this video. Oh no, no, no, no, no. Foolish Chuck. I wish, and I think it's coming. I saw a few pull requests on GitHub that you could host the models locally and then pull those down.

I do love the fact that they break up the model across your network, but it would be so much faster if the model was local if we didn't have to pull it from hugging face each time. Alright, now we can finally try it. I think here we sneaking go, Hey, that's actually not bad. We're using all hosts, 15 tokens a second. Memory usage is good. GPU spread across all our hosts. I'm happy with this. Not the fastest thing in the world,

but it's stinking working. I love this. Let's test networking. All right, we got our networking monitoring set up. We'll launch XO once more, five nodes up and let's test it out and watch the networking go crazy, I'm assuming. Yeah, here we go. Okay, so we're distributing the network traffic across all the hosts. What's funny though is, I dunno if it's just IF top, that's acting kind of crazy, but it's showing like Deb two is my computer and it's showing me being the highest bandwidth receiver. It's kind of weird, but looking at the cumulative, I mean 64 megabits per second, that's bites.

And we're getting about 10 tokens a second. Let's test Thunder Bowl. We'll test two hosts, just as we did before. Let's see how we do. So watch two hosts just go crazy. And performance is, I mean we are using Thunderbolt, right? Yeah.

Using Thunderbolt performance is meh. We're not using any swap are we? So no swap ram meaning. So if we ran out of ram, it would switch over to swap, which means it would start to borrow Ram space from the hard drives. The SSDs, which are less performant, not as fast ran is extremely fast, which is why it costs so much. Alright, let's test the team. Alright, five hosts,

Thunderbolt bridge 70 B. Let's see how we do. Hey, that's not too bad that actually I'm happy with 11 tokens per second steps being spread across. Man, if we could figure out this bandwidth issue, that would be killer. I don't know how exo labs are going to solve that though because we're at the mercy of what hardware we have. Maybe they'll figure out something clever.

Let's check the networking and then we'll jump into our final test. Can we run the 4 0 5 B? Now at the beginning of this video, I did say the 4 0 5 B is the biggest baddest of 'em all deep seek R one just came out a local model supposed to outperform O one in a reasoning. And their biggest one is the 6 71 B, which is just a behemoth. And no, I cannot run that.

That's way too big. I like doing the thunderbolt run because on the networking side, it's very obvious that the hosts aren't talking to me because they're on their own little private network. Full thunderbolt bridge connection. Let's ask it a fun question. Let's see if it'll do this. Ready, set, go. Okay,

so still 10 tokens per second watching the networking. It's funny and we're not seeing a lot, are we? That's weird. Is it not using the Thunderbolt bridge? They're definitely connected in that way. Am I losing my mind here? I'm monitoring these interfaces, right? Why is the scale all the way up here to 19 megabits when I'm not seeing the progress go, I'm probably just doing something wrong. It's kind of strange. Okay, enough of that. So 70 B, we know we can run this, but now I want to run the biggest, baddest model of them all.

We'll see if we can run it on 10 gig first. Now I will say this, to run this model, it took me a bit because to download that model, it is so stinking large. And yes, it's amazing that I can just click on say something.

When I went to the 4 0 5 B up here, let's see, because they have it sitting right here and it would start to download to all my hosts, but it took forever. And when it finished, it was still kind of buggy. I didn't trust it. So I wanted to download it locally and run it locally. But that involved me finding a pull request that allowed that a feature that wasn't involved yet. So I did that. I checked it out. I'm currently on that branch. And looking down here,

you'll see I have a local 4 0 5 4 bit model. Now I'm not even going to try around this on one host. It'll kill itself. Actually, you know what? We should just do that for fun just to see it happen. So running one host, and by the way, I did have to download the entire model, which I think is roughly 200, almost 200 gigs, and then put it on each host, which was way more efficient than downloading it from hugging face. Alright, to tell me a story, watch this ram load up. It's like, ah, ah,

it's about to scream. And then you'll probably see the swap. I mean you'll definitely see it, I believe. Watch the swap right here. We're at 50. Here comes the swap. There it goes. We're going to use all the hard drive space and this is, it will end up working I think eventually, but it's just, I mean we're at 20 gig swap. I don't want to use it. I think I'm almost out of hard drive space. I want to stop that right now.

Get out of here. It'll probably end up timing out. Let's see if it goes back down. Okay, cool. That was scary. That's the reason we don't do that. But if we share RAM between all of our hosts, it should be a bit better.

Let's make that happen right now. Let's get our cluster running. We're still on 10 gig ethernet. Cool. All the hosts are up. It does have the model running on each. You know what,

actually I need to run kind of a special version of the command. I need to specify MLX as the inference engine just like this. It should auto discover that and run that. But I want to make sure I don't screw this up. Okay, we're active. Let's stink and monitor and see what happens. Again,

this was our goal to run the biggest and baddest model if you ignore the recent news. But here we go and run. Alright, let's watch the RAM fill up across the board. So it's filling up the bottom guy here, bottom left. Hopefully it doesn't get to swap, it'll just start to disperse it across all the nodes. Is it filling it up here? Yep. It's doing the top right guy for now. So swap is inactive still.

Okay. It's just slowly filling up the cups of each node. Alright, now we're filling up this guy. Still haven't gotten to a point where we started generating text. Alright, filling up the top left. It's just taking a minute to load the model and memory. And I think we're almost there just to fully distribute the sucker.

It's taking forever. Well evenly distribute, swap. I'm curious about that network error. But it did start, it gave a word it said here, but so far I don't see any swap memory used. Let's refresh our page and try it again.

It should keep the model loaded so we don't have to wait again in one paragraph. Okay, here we go. Generate something for me. Let's go. It's doing it.

It only took five seconds to get to that first token and we're rocking a blazing speed of 0.8 tokens a second. But you know what? We did it. We're running the biggest, baddest model of them all on local hardware. What would normally take an entire data center of stuff? We're doing it right here. Slow, but we did it. Take that Zuckerberg, take that musk. We don't need you, although we're using your model.

So we're rocking 0.5 tokens a second. Will it be faster on thunderbolts? Let's see. Alright, I'm going to stop this nonsense. I got Thunderbolt up and running or connected. Let's run XO now. Oh wait, I got to do my MLX version. Alright,

five notes. We're on Thunderbolt. Here we go. I'm excited to see what this does. So right now it's currently not loaded in Rams, so it might take a bit here. We stinking go. Ram's going nuts. Here we go. I wish you'd fill them all up at once. Man,

it takes forever to load this model. Goodness. Coffee break. Haven't taken one of those in a while. It's going to time out before it loads it all up. Yep, and it timed out.

Let's try it again. Okay, GPU is spiking. We're getting some stuff, but we're not using all the GPU. So again, our bottleneck here is the networking. I would love to know what the experience would be with some serious connectivity between the GPUs of these five max studios.

Now performance is not any better. We're talking 0.6 tokens per second. We kind of froze at that. But as far as RAM goes, this is supportive, very exciting, very slow, but exciting. Let's check the network activity. Alright, let's get time down on me. So networking, I mean it's not using a lot of bandwidth, it's just not, okay, old man, let's put you to sleep. You're too slow. Okay,

so what I want to show you real quickly though is the performance of Alama. So llama, if you don't know what that is, is probably one of the best ways to run AI models locally on your machine. Any machine, it's so easy to install. So Alama lists. I'll see what I have right now installed. Alma's not running. Let me jump into the gooey real quick with Russ desk. I forgot I wasn't using Linux, which is why I love Max.

Sometimes you forget you're not using Linux. We'll run the 70 B 3.3, it'll take a minute to download. It's 42 gigs, but you'll see how much better this is now while it's downloading.

I was talking with Daniel Misler, the creator of the Fabric project and he, he's such a great guy, I love him. I texted him and said, Hey, please add this support for XO Labs. I want to test it for fabric. He did it. So let's update fabric real quick. Alright, so here it is and it's going to go pretty quick. Oh, look at that. It's fast. So the model is definitely loaded up. No swap though. GPU usage.

It's using the full GPU. So here's the thing with xo, I think that maybe MLX performance with Mac isn't quite there, but we'll see something different on non Mac computers. So things actually running Invidia, GPUs, I'm replacing my video editor's, PCs, the ones that have Nvidia GPUs with these Macs. Should I do another video where I cluster all these extra computers together? Let me know, but it's doing great. Now, as you can tell, it would not go well if I tried to run a larger model like the 4 0 5 B. But this model with my 64 gigs of RAM on my Max Studio runs like a dream, which is pretty amazing. The 70 B model is awesome.

The new model from Deep Seek, we got to try that. They have 70 B as well. Let's see how much space we have on our machine here. Yeah, we got space. So let's try and run this 42 more gigs. Whilst doing that, I'm going to run fabric. Okay, let's test this out real quick. Hey,

how are you? And we're not using SWAP and we're running deep seek. That's huge on a Mac Studio. That's amazing. Okay, I've got fabric updated to this branch. I'm so excited to try this.

And by the way, I'm able to use Fabric or rather, Daniel Meer is able to implement fabric with XO because XO uses Chad gcp T compatible APIs. That's not the command. Let's run this set up. We'll do xo, make sure I'm at least real. Lemme run all my stuff here. And every one of these hosts should run its own little API and I'll run off the main one here. So Echo, tell me a story. Pipe that into fabric.

Okay, not working. Oh, there we go. No. Oh, it's working here. I want it to stream to me though. Will it stream? It may not support streaming. Oh, there it goes. Oh, that's sick. I'm using fabric with this. This is so cool. Okay, that's so cool Daniel, thank you so much and let's test it up. Okay, things are happening. Ah, yes, fabric. Let's take in a minute, but here we go. Oh my gosh,

that was a quick story. Lemme have it summarize something. Let's go to bleeping computer. Let's take all this text, pee paste, summarize. Boom. Okay, so we're sending it a lot of text and there it goes. Oh my gosh,

this is awesome. Yes, this is having a, anyways, that's stinking cool. Alright, let's bring this video home. It's been way too long. XO Labs is very cool. I'm excited about it. For the Mac with MLX, I think there's still more work that has to be done. Currently networking is still a bottleneck, although I dunno how it's going to perform on an Nvidia based cluster. Let me know if you want to see that below. Also,

I was kind of thinking about doing a raspberry pi AI cluster with Exo Labs. Lemme know if you want me to do that. That's all I got. I'll catch you guys next time.

2025-02-19 22:15

Show Video

Other news

The AI-Powered Enterprise: Insights from Lumen Technologies | Intel Business 2025-04-05 03:23
Tesla Sales Slump, TikTok Deadline Looms, Roblox CEO | Bloomberg Technology 2025-04-04 05:09
Linux 6.14 is here, HP considers SteamOS, plus new Zorin OS & EndeavourOS and more Linux news! 2025-04-04 00:26