George Hotz | Programming | tinygrad LLVM speed on CPU Apple M3 edition Part2 tinycorp.myshopify.com

Show Video

good morning everybody we're drinking green tea and I was re-watching yesterday's stream and I realized I had some bad habits I say you know and write way too much so it's time to stop doing that right you know uh I also realized that chat was full idiots and I shouldn't listen to them so there's that okay so let's get into what we're doing today um oh yeah we're going to we're going to keep it on uh subscriber only chat subscribers are slightly less stupid than the average chatter um yeah I realized that I have to assume that there's a world of people out there who watch these streams [Music] and they're different from the type of people who comment in chat because if those people don't exist then I don't know what the point is okay so we can break this down this is this command run in tiny grad and we can break down all these things there's a bunch of so okay I need to work on explaining things better I have this internal language in my head maybe it's my Chain of Thought which often makes it difficult for me to express what I'm thinking and it comes out in some kind of confused pigeon where I shouldn't be doing that I should be communicating more clearly so we can break down why see now right now I'm thinking this but I'm also thinking I just want to paste this into chat GPT which I logged into uh what is this question enter I just like to see where what chat GPT thinks s um can you break down the reasons tiny gra the SL this is on on and apple M3 let's see how much chat PT knows no these aren't right uh okay chat gpg was useless oh okay is another one of my verbal ticks I need to stop with the verbal ticks if you don't have anything meaningful to say don't say it we can break this down into several different reasons why certain operations are slower and for each one of these there's almost a different reason but it's not like a crazy it's not a crazy long tail there's a small number of things which we can fix and Achieve performance that's better than P torch so before we focus on just the negatives we can focus on the positives um these are faster in tiny grad than in torch these are faster in tiny grad than in torch but certain other things are slower it's not as bad as you might think and we can get get into the reasons why so a one so is another verbal tick a one by one add is a very fast kernel you can see how fast it is here in torch put myself up in this corner because we're going to be using this time a 1 by one kernel is very fast so we're limited entirely by kernel dispatch speed while tiny grad's very good at chaining multiple kernels together and using all the graph apis to do that when you're just doing a single kernel you're limited entirely by how many layers you have to go in to uh to dispatch I actually improved this a bit this is external Benchmark kernel launch uh that's for metal but if we do it with lolv it's taking about 01 milliseconds or 11 micros uh to launch a kernel and it is going through a bit of python in order to do that so this python can be made a little bit faster and that's what will close the gap for this first one but that one's not all that interesting these cats are twice as slow because I'm pretty sure the code is bad I'm pretty sure the uops are bad so if you want to take a look at that you can go in here let's take a look at it with clim the same slowness in clang for debug equals 4 we can actually look at the code so this is the code for cat actually is so this is the this is cat and what cat is doing let's read it it's just concatenating the two tensors along an axis but tiny grad doesn't have a real way to express uh memory operations like this so the way that it's doing it is it's making a big empty tensor and then it's adding it's it's taking the first tensor putting zeros over here right first tensor don't say right first tensor Zero's over here second tensor Zero's over here and then it's adding those two tensors together which functions as a cat if the rewrite was smart it could figure out that it doesn't need to do all those extra loads but I'm not sure how smart the rewrite is so here we can see the generated code up here and no the re's not smart so take a look at what's happening the this is the full 512 that it's catting this is a Boolean which determines whether you're in the second half or the first half then this is if you're in the first half it reads from the first one if you're in the second half it reads from the second one and then it adds those up unfortunately yeah this is a very slow way to do this what should really happen is that this Loop should be broken broen into two Loops this variable should break this Loop and then you can do just the first one in one and just the second one in another and you can avoid all of this extra stuff you don't actually have to do the add so maybe there's just a simpler way to do this instead of breaking the loop do you want to say hi yeah do you this sure what did I get vitamins this is Amanda hi nice to meet you she brings me pills what are these ones it's a de complex why so what happened to the purple be complex so we we looked at um exactly it was in the Monster Energy drink that you liked and we tried to replicate it and vitam in for where did this one come from the drawer in the drawer it was in the drawer wow I'm not trying to poison him I promise what's this one that's Z carstin eonia what's easia it's um it's a plant to um to help you with with the congestion do I like plants do you like animals do you like minerals what's this um yeah you can see you can see yourself on the camera we're talking about that's a magnesium L3 neurom oh that one s good neut let's eat that one the other ones are good too what's this one so there that's the um NAD plus precursor we're on the Brian Johnson pills team we eat pills you should see she eats a lot of pills she barely eats any pills this is not that many pills but pills are part of a balance diet seen I got my Frosted Flakes my orange juice my banana and my pills yep all right I'll leave you to it bye enjoy your pills we have pills have to eat them so it might be it's probably simp blur to fix this by this doesn't have to be an add because you're adding zero there so we can look at what's going on maybe here where's the best place to look at this yeah so like you kind of have to just look at this pattern and know that these are gated you can see these indexes are gated by this comp LT and comp any and then you don't actually have to do this add instead of doing a gated you kind of still want the load to be to be gated this isn't that easy but you can look at this code and see why it's considerably slower than than torches which is just going to basically do two M copies this should be two mem copies but due to the way that that's written in tiny grad it's constructed like this we used to have a really slow a range but we fix the a range to be fast by using rewrite rules I can show you where they are so when I talk about rewrite rules uh there are these sort of things the range one is actually here more rules in rewriter maybe here there's too many rules all over the place if I can't just simp simply keep this in my head so some of these reite rules you can see are really simple they're just things like x + 0 = x x * y = x let's say so yeah range stuff here we go it's re Rider oh it's kind so short now it's just called A Range Loop folder uh a ranges are written through this construction where it creates a diagonal tensor of ones and then sums them and that gives you a range the reasons to do this you think so range is really simple I can just take the the loop variable and plug the loop variable and stick the loop variable in but there's a lot lot of edge cases and that's really hard to express in such a way that it fits into the tiny gr framework similar with this with this concatenate thing it's really easy to express the addition of tensor one zeros zeros tensor two tensor one tensor two concatenated so there needs to be rules written like this to simplify this concatenation and that's why this one's slow so we've explained exp this one and we've explained this one I'm not even sure what that one is but it might be the same thing let's see what a right packing is permute contiguous that can be a whole lot of different things it's interesting also how high Ram bandwidth this is getting it's much higher than all the other ones in theory this chip has 400 gabyt per second of ram bandwidth but I'm not sure what the maximum achievable throughput on a single core is so all of these tests are single core tests now we get to convolutions so this convolution is faster in tiny gr about twice as fast these two are a little bit slower this one I'm not sure why this one it could potentially be winegrad so we can try that try it over here and come back to it later that explains the convolution soltions I have no idea why X is slower this might have to do with we have our own implementation of X it's possible the CPU has a built-in one we have our own we call it the transcendental Library we don't actually emit code for exp on CPU because of on CPU you're calling into these libraries which are doing a lot of instructions and if if you've ever read elon's rules [Music] for what are Eon rules for process half of why chat GPT is good is because it doesn't have ads yet the beauty of these llms is it's going to be hard to build a moat like Google where Google spent a lot of effort focused on optimizing the long taale of search and that's a reason that they became really hard to compete with they also had a data Advantage where people would want to get their stuff into Google so everyone would help Google index them this may become true for llm companies also but due to the nature of llms being basically you just can shove anything into them that persistent indexing advantage that Google had that made the web crappy for 20 years is probably gone so this is yeah wow see this is much more beautiful than every time I've tried to Google this thank you chbt make the requirements less dumb try to delete part of the process simplify or optimize I have one like I call call it rule zero I really like these rules but we had rule zero and Rule zero is surface all complexity if you have something that is a oh we have have a supplier and we can get this or we have a library and we can import this you don't really know what that's doing I want to bring that complexity as close as possible to the surface and that's kind of the philosophy in tiny grad we have a I can show you the X fre writer in transcendental so this is our function for x and for some reason it's twice as slow as the one torch is using so somebody can look into that but I think that is unrelated to this and it's certainly unrelated to this and it's also unrelated to this so those are three separate problems they're all approachable gem is an interesting one notice how much worse we're doing on gem also notice how gem in torch is a big outlier for the flops the reason that it's a big outlier is that it's using different Hardware so the Apple M series chips have something called the AMX and it is a uh again this is where I this is where I I struggle to communicate I have internal ways that I think about this stuff that I don't always verbalize this is is a multiply accumulate array I put this in the document yesterday those docs are now live on docs. .org go to speed so multiply accumulate array as a Mac array the main value of these is since they are 2D they create an N squ ratio between the compute and the input data I understand when I write a sentence like that that that not too many people are going to just read that sentence and understand it I I don't know I don't know I gave a talk in Poland last year about tiny grad stuff and it was clear that nobody understood anything that I was saying when I write something like that it's interesting because I've had someone start to do this to me whereas they'll write things like this and I don't understand them and this is not a virtue it's not a virtue to not be understood you should be able to make things understood simply this picture explains this better than the sentence does The X and the Y are the O of n inputs and this accumulation is n s you can see it's n^ squ cuz it's a square it's a multiply accumulat Ray because it's multiplying x * Y and then adding it to Z this isn't what gpus use gpus use tensor cores because it's hard to fit this 2D output register into a GPU Paradigm nicely it's much easier to fit the tensor core Paradigm which has the same size as the inputs but there similar the reason that torch is so much faster here is because it's using the AMX how' we do when we enabled winegrad uh oh this one got faster this one got slower I didn't realize these were all k equals 3 I thought one of them was a 5x5 convolution but I guess not so this one is faster this one's the same and this one's slower when we switch to integrat torch probably has good heris for this winegrad is an interesting one where it's not written as rewrite rules but maybe we could write it as rewrite rules that would be the right way to do this to write it as optional rewrite rules in tinr instead of right now winegrad is just implemented in tensor dop if we go to tensor pi and we look for wo um you can see that this is the normal com logic and then this is the winterr comp logic so sometimes winterr is faster sometimes it's slower I'm not that worried about that there's only off by a little bit these are using the AMX and that's why there's a huge disparity now tiny grad actually supports the AMX unfortunately it's not supported in that let's let this run for a little bit supported only in clang we have a bounty lock on the guy who originally wrote the AMX code for clang adding support for the AMX to llvm but unfortunately the AMX is still slow and the reason the AMX is slow is the same reason that gpus use tensor cores and not Mac arrays this register is special let's see if we can trigger the AMX on something [Music] small yeah this is triggering the AMX so this is the the AMX here unfortunately you can see what this code is doing even though it's using the AMX to do the accumulate it has to load it in and out of this register every time and the data path into and out of the AMX is just main memory in order to fit it into the tensor core Paradigm these things are doing loads and stores to main memory every time it does this every time it does this is math these are memory accesses I wish this was unfortunately yeah this is not documented great and it should be documented better so four is ldz five is stz and this is loading and storing to that Z register what needs to happen is that tiny grad needs to understand that it doesn't need to load and save to the z- register every time this stuff also can't be optimized by llvm because it's in line assembly so there's no way that something lower level is going to optimize we have to see this at the tiny red level and then this accumulator being created in all of these what are effectively registers and they're probably being stack spilled because there's many of them should just be that internal Z register and Tiny gr needs to understand that I should add a bounty for this as well if I don't already have one um I might have one yeah I do so that's that's this here uh good it's already it's priced at what I was thinking I was going to price it at so it's $500 if you can fix this it's not that hard to fix you just need to express these loads and stores and then get the loads and stores to cancel out if there's just a store followed by a load even if that's through a loop so unfortunately even though we have AMX support it's actually slower than using this is using arm neon neon is arm's simd extension similar to AVX on x86 so that's why these ones are slow and that can be dealt with separately now what's interesting here is if you get the AMX to work in tiny grad fast it's likely that the AMX can also be used for convolutions and then we can start getting numbers that look like this but instead faster than torch so we could be 10 or 20x faster than torch on Apple CPUs the AMX is available in the M1 M2 and M3 I believe it's also in the M4 but they also have a different way of accessing it through arm extensions some something I read about this I don't have an M4 I think it's going to be very similar if you make the AMX fast to also make that fast and then we can get gains not just in gems because of the way tiny grad's written we generate the code every time so this is calling into some library and the gem Library supports AMX but there convolution Library doesn't especially when you get to one by one convolution so a one by one convolution is a gem there's no difference I say no that's not a one by one convolution that's a 3 by that's still 3x3 but a one by one convolution is a gem I think about it that's a it's a batched gem across your your width and your height become a batch dimension for your gem this is what I want to focus on today why are these Matt Vex slower that doesn't make all that much sense somehow torch is getting much higher memory bandwidths than I am maybe the first ones to focus aren't even those though because I think that these are the same problem as these how is torch capable of doing a sum faster so it's doing a sum and getting 74 gigabytes per second whereas I'm only getting 47 gabt per second I think the first thing to do and what we're going to work on is benchmarking how fast we can get a single core to read through memory I can't imagine actually maybe maybe I can no no I I can't imagine that we can't do so sum is just summing all of these numbers together I can't imagine that that sum is somehow that somehow that is flop bound and not memory bound load this big chunk of memory while accumulating the numbers as we go let's eat some more pills does Chad have anything useful to say thank you for gifting Subs I'll also take a minute to talk about this feature we have a $1,000 bounty to be matched also by 10or so you might get $2,000 corix also said he'd personally match it for Wormhole if you want to get tiny grad working on T torrent Hardware T torrent Hardware has some unique characteristics it doesn't fit perfectly into the GPU programming model because of how different its memory system looks take a minute and talk about this so this is gr SK T joint's original chip I believe that only certain cores on the chip are capable of accessing memory and then you use the communications Primitives between the cores to send things from the memory to the other cores this is very different from a GPU a GPU no cores have privileged access to memory and they spend tons of transistors and power making memory access fast gpus are load store machines they are designed to make random memory access as fast as possible for a huge number of course and they have these incredibly large power intensive caches in order to do that if you've heard about the infinity cache on AMD Nvidia also has a very large L2 cache these things are power intensive and big but they provide fast memory access for all of the cores whereas ttor has a different sort of architecture and then you can get to things like the Google TPU which go even further the Google TPU is kind of similar to the T tant chip I do think they have a shared not with the new one that's an old one here well that's really okay that's even more I think it's changed I think it's a little more generic like that a little more generic than that once you get to yeah forget the TPU V1 that thing's confusing I don't know what they saying about DDR 3 but the TPU V2 seems to be similar to what is used in all the modern stuff they have a dam that's only connected to an SRAM and this Dam to SRAM path is explicitly [Music] managed the htpu what this stuff's actually in tiny grab proper how unusable the internet is too to add some crap here so these are TPU V2s there a lot I still don't know about this this was one evening of trying to uh look into how the TPU works the basic tradeoff here there's much better diagrams of this see if we can find them for the modern ones have to get rid of duck. go I don't know why I'm using duck Dogo it's just the default on Firefox or no this isn't no it's not the default Firefox it's the default in chromium on Googled chromium where do they add these stupid things on top of it is this conference called hot chips and it has all the best on a Japanese VPN because chbt is not available in Hong Kong this is this is what the modern TPU looks like so this is the memory here and you can see that the memory is not even connected to the compute core they just have this CM which is effectively SRAM between them and you can look also at the Qualcomm DSP we're working on a uh on a contract to make this fast chips and cheese is a good one chips and cheese is one of my favorite they really break down this is a beautiful diagram they make these diagrams and they're awesome this is how the Qualcomm DSP works and you can see that the DSP looks quite similar to the [Music] TPU the unlike the well it does have this path so maybe the compute can directly access the memory on the DSP the compute can directly access the memory but there's also this thing called the TCM and the TCM is basically the same thing as the cmen they're these onchip srams that have much higher bandwidths than the D uh yeah they also added a tensor core to the new dsps I'm not targeting that one these are all the things you have to think about when you're thinking about how to actually get speed on a platform you have to think about all of these paths and how to basically saturate all of them at the same time you want to overlap as much as possible your memory accesses with your compute you want to keep every piece of the processor fully saturated you want to saturate the alus here you want to saturate this interface here you want to saturate this whatever bandwidth you can get out of that here so where does that leave us why is this slower than torch I think we will dig into that and we'll write a minimum C program to try to spin through memory as fast as possible using arm neon extensions so instead of just creating a dumb cunner on our own we can use pill so let's focus on these two let's focus on these reduces and see why we're getting worse performance think of how to WR this bought a TV last the max is way slower too I think I did look into this once and found out why this was the Rand that stuff doesn't really matter it's annoying let's actually just do this it's a little annoying because these are tiny gr kernels and when I'm using beam no you see why that's slow that's first going to create the in I don't like having the other noise in tiny grad especially because random is I it's it works it's just it's annoying right so now it's just a copy in don't say [Music] right there so now it's a copy in takes 11 Mill seconds it's actually much faster to do the randomness in tiny grad than it is to do it in numpy but there's a lot less noise uh when I do it in NP sorry I got to make that one smaller the kernel that I want to focus on is a do sum why is that giving us so much higher throughput oh we're on metal okay there we go now it's slow you can see on metal it's very fast we're getting we're getting 20 extra throughput on metal and this isn't on apple chips have unified memory this isn't flop bound on the CPU so for some reason the GPU is capable of accessing the memory 20 times faster I'll also note here that there's two kernels this is probably something that's mostly optimized for gpus because in gpus you want to make sure that you're using all of the global cores and then it just does a second it does a two-stage uce first using all the global cores to accumulate individually sticking that back to Global memory and then having one kernel which sums over that little part of global memory that's the kernel three right there so we can try disabling that and see if things get faster let's also do that sum in numpy and confirm that we're getting the same value uh [Music] seems pretty close cool let's just make sure we don't actually break the tiny grad code also curious how fast numpy is doing it that's not right matters doesn't nump is doing it three times faster how does that compare to torch this is 496 496 oh okay when you beam it we are faster than numpy we're actually crushing numpy uh wow that's cool guys we're crushing numpy it's just torches beating us or maybe no n does not get faster regardless I I don't really care about NP speed we can just get rid of that myai is slower than tiny gr wow what a shocker the thing that I care about is this number here this 44 GB per second torch is getting 71 one in theory we should be able to get 400 why is that number so far off how do we do with clang clang is basically getting the same thing it's easier for me to think in clang than it is to think in levia let's take a look at our code I have no idea why that's not fast oh no maybe I do it's accessing things here when it kind of shouldn't be those don't have to be accumulated like that we can also with debug equals 6 see the assembly that's being generated what does this do and why is it slow whenever I use chat GPT I always do check your understanding questions it's it's wrong super often most people may not notice how wrong how often llms are wrong it speaks with this very authoritative voice I was asking it this morning about AVX 512 and zen4 and it was swearing that zen4 didn't support AVX 512 which is half to half not true I tweeted about it but what it said was wrong what the code does loads data from memory into neon registers okay it's performing the the this is why it's slow it's having to do these shuffles which it shouldn't have to do yeah there's also let's disable we have that reduced splitting logic where is that reduced splitting logic now maybe it's in Ops it's still here it's called split reduce op so that's why it's generating the two kernels I'll just show you quickly so we can do you see that it's much slower on metal to not do that to do that split reduce up so look at the look at the total difference here this is a it's a six millisecond difference I'm also going to add in this so there two kernels you can see that it's. 27 just resets it ignores that first one we could do that with the rans as well but still Wast some time so that let's see if we can beam to make it faster why did it me that made it a little faster but still nowhere near this original speed some like exiting issue why that's happening that's also annoying move that to theug equals 3 push that later CU they're all opening LM all the children are opening LM they kind of have to actually they don't they're a child they don't but regardless so much state in my head so much different tiny grad State this is my failings as a manager of the project to your job as a manager is to unblock other people is to figure out how to show other people a path to which they can contribute to the larger goal there's also an aspect of motivating people uh maybe motivating is the wrong word being clear about the larger goal be clear about the larger goal and then show people how they can make progress toward have to read more books Jim Keller quote most people who go into management will not read any books about the topic or maybe they'll read read one book I read 10 books I think the split reduce op difference is going to be notably smaller with llvm so it's still a little bit slower we're still only getting 33 there we're getting 47 so the split reduce op is still desirable we're going to do this with clang where I could actually read it we have to avoid all that shuffling I think there's a way to do that just in the don't beam one four the syntax is annoying used to be much prettier how this would print now it's it's so robust now I almost want to remove it this is in debug equals three we're both garbage should be removed let's let's remove what did it used to print it used to printed the lazy it used to print the lazy buffers which were simpler to look at okay this going to be a problem for later I want to look at the assembly [Music] this isn't doing any more shuffling but I see why that's slow because it isn't doing any speed either that generates the shuffle see if we can manually find things that are fast don't to worry about optimizing that we just want to get this number up so we have to schedule idea that's Cod gonna fail now can't copy out on allocated buffer because we can actually run the schedule items to show you what this is these are these schedule items so we have to lower a schedule item there should be an easier way to express all of this there should be an easier way because it's a common operation where all you want to do is change the default optimizations sh there we go that it works again complex logic here that needs to be removed where are those optimizations actually applied if I start to forget how this works it's hopeless for any I mean other people can figure it out but this stuff should all just be way simpler what have a two do make it easy to alter the Ops for schedule item well I will show you something we can [Music] do so we have a compiled Runner we can create one of those go look at the exact item so we got to do AOG how do I generate a compiled Runner program I can get the program backc from the compile one that for now just triv shove that back in and confirm that still works still works okay the difference is now we can replace the source code yay ah these things are all data classes which is really nice [Music] and this is still this multistage reduce I don't want it to be multistage let's focus on let's make it a single thing so how do I do that uh this stuff's actually pretty easy you can just do with context uh split reduce the H equal zero okay now we're back to one the rall is a little bit off but that's still very much [Music] correct okay great um we're printing The Source now we can put the source here I don't know why that doesn't why there's no oh I turned no opt on let's turn no opt off yeah that's like something it doesn't matter we're replacing the source here so prog spec equals replace prog spec Source equals reduce Source right so now if I were to add like a typo in here you can see that it crashes it's pretty good I you have to know a little bit about the the internal structures to do that surgery but uh it's overall pretty easy okay so let's just write something simple like this we can switch to oh we can't actually switch to the I change name too let's call it [Music] reduce data one is 16 million imputs um if we want we could even load this in we could put this I wish it was a way to like tell vs code that this is oh I broke it or I didn't [Music] uh wrong seems right now okay uh so if I were to like Nerf it and make it only go to like 15 you see that I broke the code and I get the wrong answer great so let's look at the one that's optimized a bit and let's oh I have to type def float 4 [Music] too let's look at the assembly that's being generated okay so we're loading I believe that's post increment style logic no that doesn't actually do anything I don't know why that's there oh we load twice because it apparently implicitly unrolled the loop here we're getting 11 gabyt per second and torch is getting 71 so yeah that's not a lot of gigabytes per second an obvious change that I wanted to make is this and then we can add them up later also just ask chat do how to make this code faster it's p how does that [Music] look semicolons spend too much time coding in Python you forget to just put semicolons at the end of everything let's see initializer for this [Music] okay 27 gigabytes per second that is an improvement I'm sorry 23 gabt per second uh that's actually printing the wrong thing that's not actually what's no it did on that's the wrong thing oh it's recompiling it down here I see you can see if the rec compile down here and running here so why is it what is it dupen if you can see it's still the right answer ldr q1 pretty sure it's 128 cross all four layes what oh wait sorry that's the loop right there okay that's that's that's stupid that's like barely even run that's the loop uh for some reason the loop isn't fast uh okay I mean my next thought would just be to unroll the loop cuz like all the time all the time in this thing has to be spent in that tight assembly right I might not be able to let's just try something like this [Music] 65 all right all right that's better that's more torch like yeah like that's the code I want tiny gr to generate I don't know why I can't just generate it um also is this doing a fad stupidly does one there I like that it's combining the things oh I should also be able to just say that g to help no same code okay um let's try this okay that's pretty good at 68 I'm pretty happy with that but we need to go faster again the theoretical limit of this is 400 I don't know why we're not getting 400 cuz our G FL are not it's not a lot of g- flops that's minimal G flops ah so I think this has to do with sequential versus nonsequential [Music] I was thinking of Serena Carpenter a nonsense Christmas you know okay well that's fast but also wrong oh uh yeah because we can't do [Music] that wa what oh huh that's right but slower okay whatever we don't want to do that um let's also make another change here these shifts are kind of stupid uh was the old one just do this and then do something like plus 16 get rid of these shifts I'm sure lvm is optimizing this out anyway but yeah that should be correct [Music] if I get my count of stupid parentheses right those are just nonsense parenthesis just like a nonsense Christmas by Sabrina Carpenter [Music] um think I B it got me feeling Christmas they're pling with the coffee shop this morning okay good that's a good amount of nonsense all right 71 gigabytes per second that's a lot of that's a lot of gigabytes that's even more gigabytes than before well that's only 66 gigabytes okay that's not too many gigabytes 69 72 71 don't tell me the same Ram crap as yesterday okay let's try something else what if I create two accumulators and do something like like this this doesn't really matter where these things get added right this stuff at the end is all like just irrelevant because it's outside the Lo 74 okay we we match torch yet oh we're beating torch yeah sometimes we want to avoid uh Lane conflict but how's this I see yeah there's like a dependency chain here which we don't want actually never want that so you know what we can even go crazier with this we're doing now we're doing reduction 74 that's pretty good 72 65 64 73 okay it's already coalescing those into one read like this is effectively the same thing as one float 16 right hold on let's just try more you probably want to like use all the registers you know what I'm saying I'm not going to start saying know what I'm saying know what I'm saying know what I'm saying know what I'm saying there to many know what I'm [Music] saying oh oh oh be fast be fast be fast [Music] 75 okay but this still wasn't getting at like like that code looks very fast to me now just like look at this let's ask chat gbt how to make it faster think about it got me feeling Christmas how do I make this faster let's eat this pill let's put some of these things in maybe that'll make it faster I know fast math breaks some stuff same code oh no that's the same code because I disable compiler cash [Music] comma okay well that didn't make anything any faster so use Simple Loops that the compiler can easily recognize think about it got me feeling Christmas here simple Loops the compiler can easily recognize I'm sure that plus Z is [Music] fine oh yeah yeah simple Loops the compiler can easily recognize my ass why is this not fast I put in for fast math and fun roll loops I'm not stupid chat GPT GPT sucks I thought you were going to be the top competitive programmer in the world why does this thing not optimize my reduced good no it's arm oh maybe it doesn't know that it's arm a common optimization is to maintain multiple accumulators parallel then do a final horizontal sum but letting the compiler do it automatically what no I tried letting the compiler do it automatically and it got me that dog shit time okay we need to go faster why why can this be slow I don't understand is it just slow reading the ram is why I wish do I have a way to just shove assembly code in there I do we're gonna we're going to go to the assembly boys we're going to the assembly [Music] so one of the awesome things about tiny gr now is that the code is no longer the code that's running here is no longer an it's just assembly um we have one more pill let's eat this pill we finished pills is Capstone only a disassembler oh that's chat gbt it's not I want I'm getting rid of Duck Duck Go search that's it we're done with Duck Duck Go search it's going to duck the go search for Chrome settings search engine name Google shortcut google.com add Google to chromium think about it got me feeling Christmas there we go what no how come I can't edit that why can't I edit that why can't I change this change those things but I can't change that okay let's use Bing anyone in chat know why I can't change that oh should I try 03 you want 03 okay fine we'll try 03 it did say 03 doubt it's going to help but we can always give it a try okay this is a simple Loop the compiler can recognize same garbage 03 did nothing [Music] I actually remember 03 being worse on some of the things I tested we're going to the assembly boys it's the only place to go let's use Bing I hear it's a good search engine Capstone assembler python I think it can assemble as well I just I just my my my my eyes hurt my my eyes hurt am I the only one who feels this way Explore More 60 inspiring Capstone project ideas how many Capstone project examples are there below there are 150 like it's just retard like the internet's for retards okay this is only a disassembler unicorn arm code oh oh Keystone that's the one I'm thinking of yeah yeah we got that Keystone yeah y'all got Keystone you know got some Keystone some Milwaukee's best some PBR going to the assembly think good bad got me feel all right no I gotta stop I got to stop oh how do I delete like just that let's use chat gbt and it's gonna delete one line delete up to the col on each line George don't you know that you can just do this them bullshit all right human versus AI let's go all right now how many lines did it accidentally delete oh by the way yeah I won below oh where was that it's here what no no Keystone oh oh ma what do I brew install oh open SSL do I have that how do I not have open SSL I already have open SSL no dependency should be illegal like I would use inline ass may I should just use inline assembly just YOLO that come on Chad gbt don't let me down think got feeling Christmas just like don't say it don't say it but I'm thinking it all right we built it good that worked thanks chat gbt I love like I used to waste so much time on crap like that it's just kind of fast now which is nice extra redu speed great cannot import name KS from Keystone well why is this a liar oh Keystone engine I didn't even get the right Keystone pip three uninstall Keystone and now my shit's probably poisoned what is Keystone oh I weren't even trying to this not even the right Lundberg oh it's an open stack identity oh great we had that on my computer for a little bit now we have keone poter must have explicit what huh CU I imported Keystone [Music] oh clang clang you gotta use clang um are we assembling this let's assemble something where did that example go create the assembler now we assemble into Capital arm by code programming invalid memonic okay [Music] R well if red doesn't work then oh maybe because this is included here I don't how python works with that so what what you don't like R it's because there's too many enters R uh come on rat is a real arm thing right I move R11 does that work [Music] okay well you can move R11 but you can't R doesn't makes sense can I move R11 here that yeah I can how about can I move to when I do of course that doesn't work oh cuz it's arm 64 wait rat's not a thing in arm invalid mode e is there a mode for arm 64 mode 64 Maybe okay invalid mode let's read the Keystone code oh God okay oh use Kos mode little Indian for r64 thanks chat GPT great let's see if I can compile with all the crap now great thanks chat gbt I love chat gbt chat gbt is my new best friend all right we got a lot of arm bite code um so now when we look inside the EI object we'll find a well a compiled Runner sure but we want to dive into that compiled Runner and change the live why is that that color I don't like that I don't trust it change the source code change the assembly right so pr. live equals it's

pronounced lib why do you keep saying live it's going to be a we should just be able to shove the arm bite code in there let's see if it works compiled Runner object has no attribute name oh this part's back I really don't know why it's that color I set frog somewhere else no uh bytes like object is required not list okay well I don't know why that's a list I guess I didn't check it no bites on bite code it's btes it's not a list okay great uh let's see if it's working by messing with one of the numbers and seeing if it breaks I don't know let's put like a one here yeah brand oh who can write assembly codee nobody 30 that's got to break some stuff didn't break anything I'm skeptical I'm very skeptical [Music] [Music] I don't know all right let's see what's going on here should work in case it's getting freed I don't understand um compile Runner definitely calls a lip right oh no it calls dot trog it doesn't call the lib what am I doing [Music] all right now let's try to change that to 30 and see if it breaks definitely calls the Prague [Music] zero in object is not callable [Music] okay is that actually fine there okay great seems to be working now I'm not exactly sure why loading from 30 oh I know why it works we load from 20 it won't work take it back works works pretty much regardless of what I change what I don't understand this I don't Loop okay then it's broken but how come I can change these and it's fine that doesn't make any sense [Music] uh that's very concerning okay let's change this to Rand n and hope it's centered around uh or what can I do random is between zero and one so if I just subtract it's going to be even slower is slow but whatever okay uh fine wow these things are really sensitive actually okay cool seems like it does something I'm getting rid of that change to OBS Clan it did nothing [Music] so on GitHub on the reduced speed Branch if anyone wants to play with it okay so what if I like read from like 80 all right these things will do break things eventually like that one's broken too um maybe I'll can do like like aals one seem reliable floating Point Edition is terrible uh floating Point Edition is not uh commutative it's not associative I'm sure it's not transitive it's none of those things I'm not sure normal I don't know my point is it sucks like B plus a doesn't equal a plus b uh a plus b parentheses plus c not equal to a plus parentheses B plus C associative commutative I forget what transitive means but I'm sure it's not that either floating Point math is terrible okay wait I don't know why we only got whatever we're getting we're getting numbers in like the 60s and 70s we're getting numbers that are similar to torch's numbers uh torch also might be allocating on a boundary oh this might just have to do with tlb Misses that thing yesterday might have to do with tlb guys know what tlb is so let's go simple with this going to make stuff that doesn't work but should basically do do the same stuff you know what ABI is a pi and a only the memory access let's get rid of all of the stuff that's not the memory prob you quicker if I copy and paste the whole thing and then so is this ADR actually needed I don't think it's actually part of it Del that we should be stripping that should be we should be stripping that um uvn if you want to if you know how to fix that I want minimal Shell Code like I love the tiny gr it's just Shell Code all right uh what is this needed for why is this code here oh that's the compare for the loop why can't it use this one oh I guess it kind Mak sense [Music] whatever okay we're only getting 72 gabyt per second uh read like we're not limited at all I took the math out and that's just entirely the read speed um just for fun let's see if this becomes reliable if I align it to a page boundary [Music] it's a lot more [Music] reliable all right so I want to fix the Malik allocator to align to page boundaries whenever things are big uh actually it's a one minor um but no like this isn't the right fix whatever okay well that makes things faster at least more [Music] consistent um we're in we're in 1 millisecond timing range so like the CPU should have no problem reliably timing that uh I don't know what it's doing now but why was that one like why was that 150 something uh I don't know whatever okay so just doing the memory accesses alone I'm only getting 76 uh this is only accessing memory at 76 kabes per second which I'll not is also torches speed the reason our speed doesn't match torch and this is very fixable now uh I'll throw I'll throw a $300 bounty on it if someone wants to actually do it um find out how to modify the opt Ops in order to do that correct Swizzle so we know how to get the same speed as torch I still don't understand why torch is that slow and I think we're going to spend the rest of the stream doing this but I'll put a I really want to get like I can do it and implement it with a lot of these bounties I'm capable of doing the bounty but my goal is not to do the Bounty my goal as a manager of the project is to enable others and bring people in so that they can do the Bounty uh so yeah like me me actually doing it misses the point there's no way for me to if tiny grad's going to succeed as a project there's no way for me to code the whole thing uh just it's it's it's too much work and there's too many of these things to do uh we have a pretty great team so far uh so we're we're we're five people uh and these people have done a ton of stuff independently you can you can look and uh you know see see who they are what they've done on the GitHub uh and it's amazing and this is how you you scale a project like this uh so finding people like this is is super important people who can independently look at this stuff and contribute value so I'm trying to make bounties that are accessible so I'll throw a $300 Bounty uh on on that actually I'll do this right now let me write out the whole line we had some bounties for this stuff way back in the day and they worked pretty well [Music] green or yellow on Mac okay so here see stream for how to do it so the problem is the you need to like Swizzle and add them at the end I think that's just an opop this is probably a TW line change might even be a oneline change that'll get you those $300 how do I access memory faster multiple threads unroll your Loop okay ooh prefetch oh that's getting fancy I should have 400 GB per second of band with how to faster this is Apple M3 how to faster prefetch interesting all right uh let's try that stuff can go there see if that changes anything no same uh and let's unroll the loop that didn't work uh change that one too 79 it's the same that didn't make it faster print the address of the buffer uh how do I get a pointer to that how do I get the address C types to address of [Music] no I don't want multiple threads let's try one of those get rid of the thing which didn't help doesn't help Okay um I mean maybe it just doesn't go [Music] faster it's pretty much the same speed as torch how is this one faster though that's somehow faster how's it doing that okay so here's an idea right like right now this is doing this all sequentially I might be able to what if I do like four different what if I launch what if I kick off like four different groups you know what I'm saying you know what I'm saying so let's instead of incrementing this 32 let's increment that 16 let's divide that by two uh and then let's add that to those things [Music] a lot slower huh why wait no wait 83 oh no that's faster oh sick great um all right let's just make something called Data one high and say data one plus [Music] that oh yeah now we're talking yeah we in the 80s now still right right I yeah obviously it's working okay why does it sometime get me in the 80s and sometime only get me in the uh the 40s but okay okay that's good progress uh I don't know let's try four of these get a swimming pool liquor you Di try reordering them too well now we might run into register pressure issues too actually if we do this and then we just need to figure out why tiny gr doesn't have these tricks in the search proper four okay let's go [Music] [Music] all right I still don't see anything faster than 85 well it's only going to give me that [Music] then let's look at the assembly for this let's put this in the assembler it's just annoying there's like a way to do this in Vim that's fast uh remove up to on each line glad you reasoned for a couple seconds about it before doing it there was a meme on Twitter about using uh llms as data parsers and like you actually totally shouldn't do this because at some point these things just start to subtly get things wrong like what you really want this thing to do is generate a program that does this show the user the program and then runs the program because a short program is designed to be right whereas a lot of these things we'll get there we'll get there um that's like where these things are going to go Okay cool so it works in assembly let's get rid of all of the uh fads wait no this can't work in assembly what what how does it even compile how does that work you can't just you can't just just do that okay I'm not gonna ask too many questions I'm just gonna fix it oh I gu I mean I guess it's fine I guess it just numbers them correctly that's fine I see how that works never mind all right so it's getting 86 let's delete the fads whoa it's doing a ton of that okay that's kind of fine I guess 100 whoa okay so without fads we can hit 100 sometimes if we get lucky we'll get to the sometimes problem later okay 82 uh then what might make this faster is having more registers more of these guys anytime you can make things like independent of each other it's generally faster 100 eight yo yo we'll even torture the dust Boys who leing torch in the dust 108 [Music] cool yeah that's sick that's as fast as it was without the uh yeah I don't know I don't know it's going to be hard to beat [Music] um yeah think it's pretty good stream 108 it's about 100 [Music] fast like a NASCAR all right now my tiny gra generate that code I'll do it next stream if Nobody Does it if Nobody Does it I'll do it next stream but I want somebody to take time and look at this and figure out how to oh cool so what the optimizations that he was doing by hand in the sea in tiny grad actually corresponds to this other thing and we just need to basically add the other thing to the the the search space right it corresponds to these op doops uh so op Toops are interal by choosing the right set of opop so the the next steps here are to go and take op Toops find a set of opop that has the code autogen generate this first you want to do that manually then you want to go in and modify the search to have the search find it for you I believe in one of you to be able to do [Music] that seven minutes with chat and then I got things to do today going to say if you static or register the C variables they already are registers I'm looking at the assembly so I don't think that's going to help too much non Subs if you want to talk now's your moment I'm glad you guys like Amanda she's very nice last weekend I know what we're going to do the question of when true true artificial general intelligence might arrive has been circulating in AI circles for decades and it's never been more relevant than it is today uh so Tandy boy you should uh read this there's no consensus but there's healthy skepticism and open inquiry I read a thing about how chat GPT makes people less politically polarized and I largely agree uh do you like deep seek let's say I am dishing out $200 a month just so I can just get unlimited a one uh he's not familiar with deeps unfortunately you're gonna DJ cows you're going to have to clarify um is AMD GNA sue you for licensing issues let's say I'm not aware of any reason that AMD would sue me is 03 available yeah where to start if you want to help with tiny grat uh no actually I there's no O3 yet I have O3 mini and O3 high but I don't have a where to start if you want to help with tiny gr why don't you explore the Repository set up your environment familiarize yourself with the code run the tests and examples I'm actually not sure if that works you might have to do p test test uh the example is actually beautiful I'm Miss uh pick an issue or feature to work on yes that's great and then contribute your changes and then you can engage with the community wow um would you rather 801 or deep seek in a humanoid robot body uh I mean neither of them can control the body you know they're all kind of like Stephen Hawking any quick advice to a CS Dropout I hope this isn't like polluting my chat GPT start building right away you'll learn a lot faster by actually making things wow that's that's that's really true real user feedback will show you what's valuable focus on solving a problem master the basics of business and marketing good points learn continuously iterate rapidly and ship often Network and find support I mean unfortunately chat GPT does have this like like uh PMC way of talking I would never say something lame like that but you should definitely understand the these ideas if you don't understand how to get people to use your shit whether your shit's useful whether your shit costs a good amount and whether people are excited about your shit like yeah okay that's important but that's how I would say it all right Sam Alman says Chad gbt will be smarter than him more hype man nonsense let's see what does smarter really mean pseudo code do you mean python I love this I love this I love copying and pasting your chats into chat GPT then just reading what Chachi BT says because Chachi BT is far more patient with you than I would ever be I'm sorry yes to to answer your question if AGI is already here well can you multiply a matrix let's see here we go and now it becomes so clear which of these questions are uh trolls uh thoughts on AI integration your editor uh yeah I tried cursor and stuff I don't think it's that useful if you see how I've used AI so far on this stream I find it very bad at just I you see I've copy and pasted almost nothing except for the

2025-02-13 02:28

Show Video

Other news

全系列大對決！5.8mm薄旗艦機 Samsung Galaxy S25 Edge 到底適合誰？2億像素相機力壓 S25+ 遠攝變焦？ S25 Ultra 效能｜散熱｜電量表現終極比拼！ 2025-05-30 14:58

Nvidia Tie-Up Helps Vietnam Tech Tycoon Bet Big On AI 2025-05-30 06:08

Tech News: Byju Returns with AI,1st AI Laptop, Gemini, Nvidia,Perplexity, Shiprocket, Samsung, Apple 2025-05-26 21:55