Jan and Daniel it's amazing to have you on mlst welcome both of you thank you for having us here so you guys are the official winners of the Ark challenge the results were finalized I guess about a month ago or so MH yeah and why don't we just start by you know tell us a little bit about your solution it was a process with lots of ups and downs like uh we started with a simple llm finetuning and um had pretty good results um but during the competition we tried to add additional computation steps outside of the llm which helped us to increase our score substantially our LM alone reached a score of 41 points I think but with the additional external help we could get the llm to improve substantially to like 56 points but they didn't count so 53.5 points at the end so can you talk us through the overall arure yeah so we actually started with a basic llm that was trained on language we started out with a rather big model it was a 12 billion parameters model and what we essentially do is we just take the arc tasks which are essentially located on a grid with pixels of different colors and we just have one token for each color and we tokenize them line by line and we then put them directly into a llm we do no uh program search or other pre-processing we just convert them to text and put them into the llm and that worked amazingly well better than I did did expect and of course there are quite a few tricks we need to do to make it work better on these tasks because the reasoning capabilities of LMS without essentially fine-tuning on the specific test are not so good uh so we essentially did some test time fine-tuning so we had essentially two training processes we first uh did a long pre-training on the official AR training set we later replaced it by rearch with which can generate more tasks more examples and this model did not perform so good on the uh validation tasks so what we do is test time training we do another training uh process on the examples of of the validation set which we get during inference without the final uh challenge which we then try to predict with a model and that gives a big Improvement to the score there were quite a lot of other improvements that we tried during uh the challenge some did uh work very well but there's also a lot that didn't really turn out well so tabs is a new AI research lab I'm starting in Z it is fed from past Ventures involving uh AI as well and so we are Swiss version of deeps so a small group of people very very motivated very hard working and we try to do some a research starting with llm and ow style models what we're looking for now is uh Chief scientist and also research Engineers you can check out positions at tabs. a mlst is sponsored by sensl which is the compute platform specifically optimized for AI workloads they support all of the latest open source language models out of the box like llama for example you can pay on consumption essentially where you can have a model which is always working or it can be Freez drived when you're not using it all of the models that they deploy support the um open AI API specification out of the box which means it's just a oneline change in your application to switch over to centerl and start saving money and make your app application go faster yeah it must have been a fascinating process kind of trying all of these different um ideas and seeing what stuck and what didn't I've updated so much so I mean Melanie Mitchell had quite an interesting paper out talking about her concept Arc and you know what it what it means for llms to understand and all of this and the um the common sense rationale was very much that language models will never be able to do this and the only way it could possibly work is if we generate an intermediate program because symbolic things are better they're more compositional they generalize better um and certainly I think if you took gp4 and just did zero shot you know solution space prediction you were looking at around 10% which is non zero it's certainly much better than something like gbt 3.5 but many of us myself included just ruled out the possibility that these things could be reasoning what did you discover on that Journey we we started I at least started with the same assumption um but we quickly found that the llms have have far more computational capability than we thought uh for example all tasks seem like 2D Vision tasks right they are on a 2d grid if you see them you use your perceptional uh knowledge to move blocks or move pixels around um and my my heris was an llm is on 1D like it Tes text it we have sln for go to the next line and it seems crazy to me that the LM just functions in that space like it has to learn to infer the 2D structure of the problem without ever working in 2D and the more we trained to the networks the more it seemed like that just wasn't a problem the LMS were strong enough had enough capability to infer this structure implicitly and we did some experiments where we try to help the giving them more information about the structure like appending uh Dimensions or coordinates but it wasn't needed it rarely improved the score and if it improved the score it did so by a very small amount that didn't um wasn't worth the additional Compu time basically so so I think one of the intuitions there is it's a 2d grid I mean certainly in um Transformers models you know we we put in positional token tokens and stuff like that so the remarkable thing is that it generalized to different grid sizes and you're saying without you needing to explicitly give it any like compasses or markers or anything like that we actually tried to do a 2d positional encoding uh on the data but it uh didn't really improve the results so it it just learns to detect The End of Line markings and to Output lines with the correct uh length it just worked fine and we then stopped with the proc encoding approach because it was too complicated and didn't give us any advantage you did fine-tune the model I think you ended up using a was it a l 8 billion is that is that what you're using on that's what we actually started with then we went to a 12 billion uh model and then because this model was too big to do much things on kaggle we actually went back to a smaller model but it was the Lama 3.2 3B model which
we found to be essentially as strong as the 8B model but uh being much faster in execution so we could try a lot of more things yeah I'm very excited about that new llama model it's amazing how much they've got gotten out of it so just with that model out of the box zero shot you know in solution space what kind of accuracy are what you're talking about without training without any training without any training I think close to zero or zero interesting the model can't even uh predict the structure really sometimes it's just outputs end of sentence tokens in indefinitely sometimes it predicts some uh numbers like it should with end lines but then some lines are too long some too short it hardly produces any solution that is even possible without any training process very interesting so so you you trained it on rearch which is Michael hoddle's um data set maybe talk a little bit about that it was very useful to have a data set that is basically unlimited like we have the 4 tasks on the easy data set and rearch helps to generate more examples for each challenge it is in a way at some point you saturate because you don't get novel ideas in the in these tasks but it was very useful to just um have virtually infinite generable training data very cool so Michael hoddle's thing is a data set generator so given a task you can create a whole bunch of additional you know augmentations of that task so you had about 10,000 of those across all of the um the arc tasks in the in the public set and you find tuned uh let's say llama 3.2 billion on that and and then zero shot you know solution space out of the gate what are we looking at oh let me think for a moment uh with the Llama model I think we are maybe about 10 to 20% performance I think interesting uh I'm not sure because we did some uh multiple inference steps and something uh so you could get the zero shot model with some tricks to maybe 20 or 25% that was the max we could get out I think on the on the evil dat yeah on the evil datas well that brings us to the magical test time test time inference or you know there are lots of different names for it and people have been using different names for it but tell us about that yeah what we actually did uh was after the model was funed the first time we did an additional training step on the evaluation data so not on the full evaluation data because we only had the examples and not the challenge itself but the examples essentially look the same than the challenge so we could just pick one output from the example as the final example and that's just train on these inputs on the CLE contest we essentially did a retraining on uh some of these examples and uh after that we did the inference step and this uh improved the results quite a lot I suppose you could argue that's I don't know whether you would call that cheating or not but in in the real world I suppose you would be doing an online prediction and what you're doing is you're you're making your predictive model as good as it would have been at the end of a line of doing online prediction you're correct with on that yeah but uh we tested different versions and it's also possible to essentially retrain on every single task so that would uh essentially be a fairer version so the model doesn't see the other evaluation tasks that works just as good but in our experiments it required more time to retrain than just putting in all the tasks at the same time and because time was so limited in this challenge we choose the option to train on multiple task at once yeah so there's there's this thing that is transductive active fine tuning which is when at test time you you have the specification and you generate augmentations from the specifications and then you fine shune on those augmentations and then you do the prediction do you mean from the test do you mean from the from the from the new challenges yeah so you you get let's say you get three new specifications for a task and then do do you augment that and fine tune on it yeah we use uh we heavily use augmentation and basically all our inference steps also in training uh we also use it for for for the pre-training in V actually um but it helps a lot to get enough data um during the retraining or test time training we are heavily data limited right because each challenge only has three examples but we can use the symmetry of the problem because all the problems work on a 2d grid and there are some implicit biases in it so we can rotate it we can mirror it we can even shift the colors in specific ways without changing the challenge actually and since the model is not rotation inv variant right is it's a 1D um llm we can create the rotated augmentations and use them to train yeah the challenge too and it's a it looks like a novel task for the llm because it's rotated differently um which helps a lot with the amount of data we have so um from the test specifications you do Symmetry augmentations and fine tuning do you do any any additional um augmentations we were thinking about a lot of augmentation stuff um most of our tests didn't pan out or were too time intensive like um for example we have a selection step where we um take the generated example like our solution and then we do an augmentation again like we we ask the lmm basically does this solution look correct from a different perspective too and theoretically you can do more augmentations there like you can shift into a another space where the problem looks simpler as long as there's some consistency in the solution like for example you could um many many problems uh have the same solution if they were in black and white instead of in color for example and in theory you could use that to to filter wrong Solutions very easily but we uh didn't have the time to get it to work basically so other other than the Symmetry um augmentations there were no additional augmentations symmetry color shift and Shifting the examples because the Transformer is um not invariant to the order of the examples so we uh also Shuffle the examples a little bit around very interesting so so then we get to the we're getting close to putting the you know the pin in the middle of the dartboard here which was like the the main thing that you guys did so you do some kind of a search and evaluation process as as part of your test time inference can you can you tell me about that there are actually different methods of sampling for an LM you could just do greedy sampling this means also you always take the token with the highest probability when sampling or you could do sample in a stochastical way by just uh taking it according to to its probability that's predicted by the network and and all these uh sampling methods uh didn't work really well if we want to generate lots of different candidates so what we actually did we implemented a selection process later on which we can maybe talk about later so uh the focus here in this uh generation step was essentially to generate lots of candidates to increase the chance to find the correct solution and yeah we first started out with with sampling and uh we also tested beam search because often there are different paths through the network which uh give different uh predictions and one of them is correct but in the end uh we settled for a depth first search which we implemented ourself because we haven't found it anywhere uh and uh that worked quite well and it had quite some advantages over beam search so what we do is uh we just uh treat the tokens that are predicted from that from the network as a search tree and we then search through this tree to find all solutions which has a sampling probability above a certain cutof value so uh for example with that we set this probability to 10% and then we get a variable amount of candidates this could be up to 10 for 10% of course uh this search has quite some advantages the first one is that it's very memory efficient you only need to store exactly one path uh so it's memory efficient as a single inference step uh and much more efficient as beam search which needs one essentially the same memory for each beam uh and uh the other Advantage is that uh it gives us mul multiple Solutions at once uh for the problem and the third one is that we have a cut off so if there are paths that that are not uh promising so in these cases the search stops early and just doesn't even follow them and that's this gave us a big efficiency boost in generating Solutions of solution candidates I would say this is so fascinating um I interviewed these guys at the University of Toronto and they were talking about the the reachability space of language models and they had this famous Roger federa game which was like you know you try and you make the language models say great Roger fedra is the is the greatest and you have to kind of find what prompt will reach that word and their theorem essentially is that the capacity or the reachability of of Transformers is incredibly flexible there's a remarkable Divergence between the way that we sample tokens from a language model and what they're capable of and you know like a clearer way of saying that is that there is a misalignment between how we sample language models and the epistemic truth or the correctness of the program that we are sampling and I find that fascinating it's also related to this creativity problem which is that when we do this greedy sampling which means we just take the best token and the best token and the best token we're kind of missing out on all of these creative possibilities and this is something that a lot of people aren't really thinking about yet I think that's a it's a very interesting problem and we have to distinguish between like a language context and the arc context it's um that there's a substantial difference namely that we only have 10 possible tokens that might be the next correct answer and also we know there's only one correct answer so in language you can have millions of ways of rephrasing the same idea and all of these are valid or good continuations in our case which is one of the reasons why the DFS works so well we only have one correct continuation and we have to find it um it's very easily evaluatable and it's very easily searchable also because the the set of possible continuations is much much smaller we can even calculate how probable each solution is by simply taking the solution doing forward path calculating the logits summing them and then we know the cut off needs to be this low for our DFS to find the correct solution guaranteed and um that makes the problem very tractable in this sense and all the sampling algorithms are built for language because LMS work on language obviously which is also why we didn't find the BFS because we we had to reimplement it ourselves because it's it's not a valid way of sampling a language it just can't work it's the language has too much possible paths and each path as long as it's valid should roughly have the same probability so if you have if you have 1 million possible continuations each probability is one by one billi million so it's um very hard to use it in this context but because our problem is discret small we can uh sample very very efficiently this raises so many interesting questions because you could argue then that the reason why these models are misaligned between program or knowledge correctness and how we sample them is by your argument because they're too many degrees of freedom in in natural language so you know it's almost like they've we they're dealing with this ambiguity and their training therefore when we sample from them we're we're not very likely to get the correct answer you have to be careful about that because um we have to differentiate between the probability of a path and the probability of the answer um you can have a very improbable path that leads to a specific answer and then you can have multiple variations of it you know um but the the probability of the answer is not dependent on any path but on the sum of all path that lead to this answer so in in language you have have this the this broad set of paths that all lead to the same solution and when you use standard sampling techniques um you sample the final answer correctly like that's the that's the use of sampling in the arc contest there are no alternative intermediate steps right so you don't need to um marginalize overall possible paths basically you can just um you you need to get the correct path basically um so in normal llms this is completely fine but I think in the age of reasoning models we get into like 01 or 03 you also need the correct reasoning steps the correct intermediate steps to get to the right solution especially if you use it for code generation or answering uh complex mathematical questions so this will be I think in in the future this will be a much larger problem um but up to now sampling was simply the correct approach a couple of things on that so certainly perhaps in in in your solution the way that you traversed that sparse space and the reason I say sparse is it's a bit like in reinforcement learning that you need to take um several steps without knowing what the value is going to be and this is really really difficult when we have some kind of traversal optimization algorithm where we rely on a monotonic signal so we're getting better better better better like we in in this case the space is sparse and we have to take several steps into the unknown but the thing that really interests me though is is even if we don't know what the reasoning steps are Ryan greenb blat solution really impressed me and surprised me because I assumed that this is just an exponentially large problem and that using a language model to guide the search would not be significantly better than just exhaustively finding programs and what he demonstrated quite um you know succinctly is that it's tractable I mean yeah we still need to generate loads of completions and and do some search but a reasonable amount of search um gets you to the solution and what what I'm saying there is that even though there is an apparent orthogonality between correctness and searching this space the correct solution isn't that far away that's fascinating yeah I think the the interesting thing about writing greenblood solutions from my perspective is that he showed that code generation can work like that the the models can see the visual problems like the 1D representation of the grids and then produce code that works to move these 2D objects even though they were never seen even without fine tuning on the other hand code generation has a massive Advantage uh compared to our approach because you can just test your code like you can check does the code that's generated perform in the way I want it to perform and if it does then it's probably a good solution um whereas our approach doesn't have that like we had to find a way to select which candidate we want to submit um without the possibility of having any guarantee that it's right well yeah this this gets us onto the most delicious part of all so as you just articulated Ryan greenat solution was technically a neuros symbolic architecture so um I mean sabaro kahati would have been delighted because you know he loves having formal guarantees he loves actually knowing for sure I mean actually in this case we don't know for sure because many correct programs are not correct of course but um but but at least you know we can run a program on a python interpreter and we can get a yes or no answer it might be a false positive now you guys have done something very interesting you're you're using the language model to verify its own value as you Traverse through this tree structure can can you tell me about that yeah uh so uh Our Generation process essentially generates multiple solution candidates could be usually it's up to 10 or 20 in practice and now we need to find a way to select the correct one and what we actually did is we used our model also for judging how how good these Solutions are uh and the interesting part is it's difficult if you do it uh without augmentations because a model of course favors its own predictions because what we use as score is the sampling probability probability we get it in the first place so uh what we essentially do at this point we also use a lot of augmentation for the judging so we essentially use 16 different augmentations of the problem and put them through the model to calculate scores for each uh augmentation and uh then we average them and this is an interesting part uh we tried different algorithms uh one was just for example to sum up the actual probabilities and the other one was to sum up the logarithmic probabilities which would essentially be equal to multiplying the probabilities and the second one worked much better than the first one in selecting and uh yeah we we really asked us what why this is and uh so our hypothesis is essentially that the good uh the correct solution looks mostly correct from every angle no matter how you rotate it it gets maybe not a really high score but it doesn't get an extremely low score and the incorrect solutions they might also look good in some perspectives but usually there's one or two or even more perspectives where they get an extremely low score like 0.01% or something in probability and so essentially these low scores are important when doing the calculation and you can't uh get these scores during the generation process because these things are never sampled uh so we actually need need the extra scoring step to calculate these low scores to sort out the false Solutions this is absolutely fascinating um so you you generate 16 argumentations for let's say you know um a test specification and as you said before you had already fine-tuned the model on symmetry um you know permutations which means it's it has awareness of different symmetry Transformations and you're saying that using it to do self-reflection on the value of those symmetry augmentations gives it different perspectives on what good looks like and then you can kind of look at those you know look at the coherence of the different symmetry perspectives to give you an idea of which one is correct yeah exactly ironically it's very useful to us that the llm does not have uh symmetry in its predictions and it's the interesting part is that it's useful to us that the language model does not have the 2D symmetries like if it had add to the symmetries we couldn't use our augmentation for scoring because the score would be the same every time so in a way um we abuse the fact that llm is not perfect at 2D tasks um also because it's it's always generating from left to right from up to down right so some parts of the puzzle get generated later with with more information and some parts get generated earlier with less information and if we rotate it we generate other parts of the solution first so a lot of the score differences between the different augmentations come from that fact that it can generate maybe complex parts or um parts of the solution first or later like um it helps that the it helps that sometimes a solution depends on previous input and if we generate the wrong solution and then rotate it sees it much earlier like it can see ah it makes no sense that like the line that goes from left to right here ends at an empty pixel it should never happen and because we rotate the problem it can see it from A New Perspective and then decide that it's a invalid solution or very improbable solution how much does this depend on the um the Symmetry augmentation before test time so so do you guys say that before we even get to any test time computation we take all of the evaluation um tasks and then we do some symmetry Transformations and we find June if you if you don't do that how does that affect this evaluation I think we have never tested it so you mean uh not doing the augmentations during training or yes yeah yeah we have never dried that out we always doation from the beginning so I'm a bit confused because because you seem to be saying it's it's actually a good thing that it doesn't know about the augmentations so doing the augmentations it seems to me that that would be a good thing but maybe it isn't a good thing it's a tradeoff like um we would lose a lot of training data and it's also it it it is good if it can just generate the correct solution from every perspective right but um some problems are simply easier from some perspective than others and in theory the pro Network should converge to a correct solution from every perspective yeah if it's deep enough it's strong enough but sometimes it just can't and then we can use that fact by generating different augmentations and scoring them correctly I'm trying to understand as well so that there are different perspectives on on language models so um sabaro kahati he called it Jagged intelligence I think which is what Andre Kath called it you know that and that's this idea that pointer listic they're they're actually very good which which means they have these islands of knowledge so you can say I'm going to do a symmetry transformation and art it it know so it it given an example of something it knows knows if it's good or bad but it doesn't know how to get there so so there seems to be like these concentric circles where they are better at discriminating than generating and they also are capable of knowing if they don't know how to get there I think in some ways yes it it depends it depends so strongly on the task like always right there are problems where it's true that they have a much easier time of discriminating the solutions than generating them but at the same time it depends how you measure it basically like we measure the the score of a task or of of a solution by calculating basically the probability that it would be generated from normal samp so in our case generation probability or generation and evaluation is exactly the same um the only way we can then use the discriminatory abilities of the llms is by using these Transformations because like Daniel said llms prefer their own outputs so we have to we have to trick the llm to think that it's a output it would have never done there's also this thing that Thomas diatri was telling me that um it is very interesting looking at the Divergence between epistemic risk and alteric risk and and that's simply that there's a Divergence between things that are true and things that the language model will kind of like you know will give you if you stochastically sample them and and he said that before rhf there is much more of a correspondence between the likelihood of a trajectory and epistemic factfulness so whether the thing is actually true or not and rhf deranges that but I'm thinking that doesn't really affect your situation because you're doing fine-tuning on the top so you're kind of I mean what you're doing is is either overwriting or orthogonal to any rhf training so you're you're actually deliberately saying you know here is what good looks like and if you can generalize from that example of goodness you can use the likelihood of a trajectory as a proxy for you know correctness we actually used an uh llama and sensor model in our experiments and we also compared it to just to the normal model and the sensor model was a little bit better so maybe that makes a little difference yeah was It retrospectively Uncensored or was it just never rhf trained retrospectively it was uh from one yeah yeah yeah because i' I've played with uncensored um models you know like the the the the Gwen the Gwen models and I found that they are lobotomized and for whatever reason like the however they are uncentered I think there were various different approaches to one to one censoring them but they just seem to degrade to the performance of a model you know half the size so like a 72 billion model will now have the reasoning abilities of a 32 billion model but that's just my experience we actually not not sure if it's at all important that we use a pre-trained model like one of our experiments in the pipeline at the moment is um just resetting all the weights and then training to check if we even use the pre-trained language capabilities or if it's just the architecture actually that's very interesting yeah I interviewed Randall bellri at neps and he made the same Discovery so he's got this really interesting paper out and you know we are taught that we need to have these large pre-ra models and we're leveraging the the knowledge inside the model and so on and he apparently no one has really done this experiment that you can just kind of and this is for discrimination not generation right so if you're just doing some kind of classification problem you can take a very large model you can train it from scratch for doing some like discriminative task and you can actually get better performance than like Frontier models pretty pretty much with hardly any training and hardly any data like isn't that and and it almost goes back to those papers from years ago you know that there are so many inductive biases just in the architecture of the model itself that you know you don't actually need that much training I think one of the interesting Parts is that we Rel lobotomize our llm too like at the end we have an llm that has basically only 140 tokens or something like we remove all language capabilities from the model just delete it um and really make sure that it can only argue in the space of Arc like it can only output numbers 0 to9 endline tokens and some letters and um we did so mostly because of computation requirements like it saves surprisingly a lot of uh rum but yeah we we remove all language capabilities so we are often asked do we do any pre- prompting is it like CH of thought thinking it can't it's all gone well that's really interesting because certainly the mind's AI team I was talking with Jack and you know they're they're very bullish on this idea of like multimodal or you know cross-domain transfer so they you know I think Jack was saying for example they might even give English language descriptions of the arc challenges and they're looking for as many ways as possible to leverage the base model in in the language model uh you know and there are two views on this like one view is that these things are a general intelligence and the whole point of 03 for example is that it's kind of like um it's creating a novel skill program by doing some kind of compositional generalization and it's like you know working in this novel situation whereas this this alternative view is very much that they are a jagged form of intelligence and very specialized form of intelligence and they're mostly just working on specific instances of very similar things that they've seen before indirectly I think we could say that we found that the model performed much better at tasks where had had seen similar tasks like um in our paper we have uh we we have claimed the performance of I don't know 76% 72 72% on a left out part of the evaluation data set um but it performance drops or dropped at the time noticeably if we never trained on any evaluation data set tasks and we believe that to be because there's some conceptual leakage we called it from some tasks of the relation data set to other tasks like they they're not the same but they have similar ideas which is probably also a reason why the additional data sets like concept Arc or bar helped because they maybe introduced some novel ideas in some way and if the llm has never seen an idea conceptually in training then it's very unlikely that it will perform well on it so for example if we if we train the llm on AR challenges that are I don't know fill FL flat fill algorithms or something like that then it will not transfer its skills into a accounting task or something like that um but as long as it has seen some counting tasks it might be able to combine the ideas to form like novel challenges it is very interesting that uh I mean just take multiplication for example uh Frontier models they don't generalize very well you know I mean they do a little bit you know three digits maybe four digits or something like that but you can take gpt2 and you can fine tune it up to 20 digit multiplication and and it works better than any Frontier Model so there's always this notion that doing fine-tuning for a specific task is is better architecturally that doesn't work very well right because you know we're in this world and we're subject to novelty all the time and we can't possibly be fine tuning in every possible situation so maybe that's when this active fine-tuning like this test time computation comes into the play because increasingly we'll build architectures where we are learning online and maybe sharing that knowledge in a clever way to to other models but it it kind of feels that with this new online version of of prediction we can kind of have our cake and eat it so the interesting thing is for the multiplication tokenization plays a big role because uh the models tokenize numbers differently and uh the model we used was using uh for some digits just one token but sometimes it combined digits to to one token so essentially there were digits 0 to n it could tokenize but then there were some two-digit numbers and uh some uh three-digit numbers and yeah I think uh newer models do this differently or probably have all uh three digigit numbers integrated but for us this was a problem in Arc so we also removed all this from the tokenizer so we just have single digits because this would have been a big problem I guess if the length changes of the outputs oh interesting so you modified the token tokenizer to remove all of the compounded num we essentially remove anything except uh numbers End of Line and uh end of problem and input and output uh prefix very interesting yeah and of course you weren't deranging it because you were doing the training on the top so you you're kind of like doing a higher resolution token version of of training exactly yeah and I think um for the multiplication task too it's um a common problem I see on Twitter is that people misunderstand how complicated a problem is for LMS like I think open I has almost every three digigit number as a token so if you tell CH GPT to multiply two three-digit numbers um it has to basically remember 1 million multiplication paars like it's a very very complicated task and then people are like yeah it can't multiply three digit numbers but it's a hard task for tgbt the same with the typical strawberry task like um in like my my way of explaining it is Imagine like every syllable is a token in a or a letter in a language youve never seen and you have to remember how many RS are in each of these tokens and you have to remember that for 130,000 different things like it's a complex task and then you have to learn it without ever seeing the letter R that's uh hard llms are very very strong problem solvers if you f tune them but they are not the intended solution of Arc in the way that we use them so in a way it's a it's a hecky solution but um because of the challenge you have to use the best performing model like you you could do a very creative solution that goes further into the direction of AGI right but you wouldn't win you win with llms and abusing like f tuning speed and stuff like that so llms are too strong for this contest to enable any other solution basically interesting what about the thing that right now um this is quite an interesting coincidence that Ark was designed to be solvable by humans and one of those restrictions is that they shouldn't be more than about 30 time 30 cells on a grid because otherwise humans would start making loads of mistakes but that also just almost coincidentally happens to be the upper bound of what we can you know correctly do with an llm if it was 50 by 50 or even 60 by 60 no one would be using llms to solve our we would have only solve the small problems yeah yeah because even now like when when we look at the failure modes of O3 there's definitely like you can see it tailing off with solution size so it starts to make more and more mistakes you know when it gets to 30 by by 30 um but it does raise the question though of the complexity of of the problem so again my intuition is that this problem has a kind of exponential complexity to it and I would have said the same thing about language it's it's interesting that language if you think about all of the ways that reference can hierarchically you know um connect to each other it it seems like as the length of even a book for example the the information complexity of a book should be exponentially large because there are all of these like connecting things between all the paragraphs and the sentences and the words and so on and language models don't have any problems understanding and explaining books and I want to understand what what what what that is so there there's the potential complexity of all of the possible connections between the terms and then there's the actual embedded complexity of of language and I I think you you were making this argument earlier that that you feel that actually the the the sort of the embedded complexity doesn't scale exponentially with the problem side yeah it's um in many challenges at least like you can you can see that like many problems are mostly trivial there are just a few pixels where you have to do the right decisions we even measured that uh and we found that in a in a intermediate State stage of our Solutions um where we still did normal sampling where we didn't use DFS we found that we had to get the model to choose the second highest probability token three times and then we would have solved 80% of the challenges but um we had to get it to choose and we don't know which of the 900 tokens the second best is the correct one you know the space of solutions the llm generates is much much smaller than the theoretical large space because a lot of the pixels are trivial like the background is often trivial like um many problems are move this object here and if you start object at the position then the llm generates completely the rest of the object correctly but it has to decide where to put this object and that's the hard part but uh the number of decisions we have to do to generate the correct Solutions is surprisingly low so does that imply that there's no reason in principle why an llm wouldn't be able to scale to 100 by 100 or even a thousand by thousand I think um it's one of the reasons why Daniel's DFS solution worked so well for us because um when you do normal sampling each token has a probability of being selected wrongly like there there are many ways of fixing stuff like that like Min pay sampling top K sampling or something like that but we have for every token we generate we have a small chance of doing it wrong and our DFS solution just generates all solutions that are above a certain probability so we are basically guaranteed to get the most most likely Solutions and some below that which uh reduces a lot of the problems with um the exponential problems of sampling as long as our probability pound is low enough how did you decide on the depth of the search and the threshold we decided about the depth by just essentially just testing it out so how deep deep can we go to find uh in most cases the correct solution and it's of course a trade-off because if you search down to 1% you can get 100 Solutions in theory though in practice you get much less and if you search to 10% you only get can get 10 Solutions and you get much less Solutions uh so in some some problems where the solution is clear it but doesn't make a difference you just find the solution but uh when the solution is unclear you get uh a lot of samples for for going down to 1% and uh that would have been computationally invisible on kaggle so in the end it was a tradeoff that we just tested out and we ended up with different values between 10 and 177% in the end that we tried in practice and the solution was mostly in there going down from 10% to 1% gave an improve I think from 70 to 71% on the score so not much and still technically a variable computation budget because if you think about it at 1% when you when you do it 100 times you could get a huge variance in the amount of searching that you're doing but it's it's kind of vaguely vaguely budgeted is is that correct so so it's not clear before and how much how many solutions you will get from the DFS yeah so this can be more or less depending on the problems and something what was that was very interesting is in the end of the contest we switched back to a larger model because we uh fine tuned uh a lot and uh when you do normal sampling switching to the larger model has a certain Factor by how much more compute you need for the larger model because the inference is essentially just end times something like uh a compute but for the DFS because the larger model got better at uh deciding what the correct solution is it could actually prune earlier in the DFS so it still took a little longer but not as much longer as we expected it to take from our previous tests and that was kind quite surprising and that also enabled us to uh do more in the short time short computer time on kaggle did you guys just just for fun at home try a 72 billion model or something bigger I tried it on the evaluation set after the contest uh but I did just one pass I don't think it was a 72 billion model but a 32 billion model or something in that range but I didn't have time to fune it really so I ran it with the same parameters and it didn't make much of a difference in the setting but of course that's not final because I had really time to tune it so yeah there seems to be an interesting relationship between the strength of the base model and how much test time computation you do which is to say there's an argument that it's almost better to do if if if if the budget allows it's better to do more test time computation on a smaller model yeah I think so yes I think the going up in size is not necessarily always good in this case the more parameters you have the more parameters you have to fine tune and for a fixed compute budget I think the smaller models are much better you can generate more you can evaluate more you can filter more um but also I think on some experiments larger models just didn't perform better like they uh converged roughly to the same score so then there's this notion of adaptability so let's imagine that you were deploying this thing let's say you wanted to make some money and you deployed a web service called Arc solver and I'm guessing what you would notice over time is that if you persisted the llm and you were always continuously fine-tuning it and adapting it and so on you would probably notice its improved knowledge by virtue of the fact that it was doing less Tre searching because it found the solution quicker do you think that that would just be the case that it would just get generally better at ARC and it would just know the solution more quickly or do you think you would get weird things happen where it might learn distractors on some tasks and get better and get worse at other ones it's a good question I mean part of the reason I'm saying that is I think that that what's happening with the1 models so I think the reason why we're seeing a dramatic Improvement even in three months is because they are learning from the from the users so they're continuously adapting and augmenting their knowledge and the models are getting better quicker needing to do less thinking over time yeah maybe I I have an interesting fact to share so for the finetuning we try a different uh number of tasks we we trained on at the same time for the second fine tuning so we did something we with one task just on single task with 50 tasks on kegle and we also tried the full 400 tasks of the evaluation set and that uh led to a very much degraded performance so up to 50 was fine but maybe for mon tasks it just didn't work maybe it was too much for the model to store at the same time I'm not sure we also used Laura which also has limited number of parameters so if we would do some online training and start with new examples maybe it would have to forget the old ones to solve the new ones that is quite possible yeah you know we always told in the the theory of neuron networks that there's a problem with catastrophic forgetting and continual learning and all the rest of it and you know even Chalet wrote in his book that when you do fine tuning you have to be really careful you have to like turn the learning rate right down because you'll just destroy anything it already knows and it will only learn the thing that you're teaching it and so on and we're just becoming quite redpilled on open AI now because that you know we we believe that these things are like magical memory machines and they're just sucking in all of this knowledge and the more knowledge you give them the better they get and that seems to fly in the face of practical experience and Transformers are just very very good at learning facts I think it's frankly incredible what they can store um but there's also slightly different thing that I find interesting like in our fan tuning we use luras obviously because fan tuning on the whole network is very expensive and um there's a small subtle thing with luras if you use weight Decay on a Lura on a network then it degrades the weights of the Lura but you always have the base Network below it so you can't you can fall below specific performance like you you're always around the correct full Network and just um move a little bit in another Direction so if you um do test and training or fine tuning with a Laura you can in a way have the safety blanket of your base model below it which can be very useful I think in our challenge like we merged I think the first Lura of the pre-training into the weights and then uh did a test time training on another Laura so at the worst case um it would degrade to the base model again like the base model from pre-training and just for the audience can you quickly because Laura is low order rank approximation if I remember correctly can you explain how it works is it approximation or adaptation adaptation adaptation yeah it's a very simple idea very elegant idea um you have all these quad quadratic weight matrices in the tension layers right and instead of fine-tuning all the weights on them you fine-tune two smaller matrices that you can multiply to get the full rank Matrix and um and that's mostly enough like if you have a 1,000 by 1,000 Matrix instead of fine-tuning 1 million parameters you say one 1,000 by one and one 1 by 1,000 Matrix that would be rank one and fine tune those two small vectors and um because because you go down to a smaller problem because you have a simpler problem you want to learn it's enough to get the network to go in the right direction basically um I think we had a surprisingly high rank right tried out different ranks from I think 64 to 20 uh 56 and the original model I think it has a rank of 4,096 and anything above uh 100 28 didn't actually make a difference for the final performance we also tried full fine tuning once it was memory very memory consuming and that also didn't make a difference for us and yeah so this low rank is probably enough for the problem and just to just to give folks an idea of how long it took I mean how how you know are are we talking like half an hour two hours a day uh so competing in the challenge as in to to do you know like um let's say at at the beginning when you're when you're fin tuning the model how long did it take oh at the beginning we had uh smaller models and we had also shorter training times especially when we still working on the original training set which we couldn't uh put in too often without overfitting so then it was a few hours maybe two to four hours okay and it got longer and longer during the challenge so at some point we were at two days on on an Nvidia h100 and the final model was essentially eight days on an Nvidia h100 so that was quite long and the interesting thing is when the heavy Arc data set appeared when we found it there were only I think 4 days left till the end of the contest so uh what we did there uh we actually started with multi gbo training to do the work of eight days in four days in two days so luckily we got a a lot of compute Lambda uh offered us to use a machine with eight times h100 gpus lump Labs uh and yeah so we could still train the model in time and uh submit it on the last day that was the model with the 56.5 score which sadly didn't finish in time then yeah they say necessity is the mother of all invention I'm one of these guys that you know like what your supposed to do when you're coding is you add one feature at a time you check it and then you like commit it and so on and um it's I I have this kind of not ADHD not quite but you know like I'm I'm just I've got so many cool things I want to do and I want to just like stick them all in there where do you fall on that Spectrum it normally I would fall similar to you but um in in the setting of the contest we just didn't have time like it didn't matter we had to push like new Solutions we had to experiment quickly and um trying to do clean code or trying to refactor in a way that makes it more flexible would have taken I don't know one or two days but that was too long like we had we had so many ideas we wanted to experiment on and try that um we had to use the code base that we had so when we would have put in new features again we would have to essentially refractor it again because uh yeah we didn't know exactly what things we would try out during the time so that was probably a little bit pointless to refactor it all the time during the contest it felt a little bit like a Mad Dash so there were some refactors uh so when it got too complicated it's like two or three steps where I completely refactored it in between during the challenge but in between these uh refactorings we just used it over and over and just added stuff essentially and would you say your intuitions were fairly correct throughout or you know like we get a bit superstitious about things so we we think oh that thing's really important we need to keep that in there were there any situations where you realize to your horror later on that something you thought was good was actually not good we had that sometimes you were you were constantly like testing different things like leaving some stuff out and putting some stuff in and we also used a lot of our submissions for testing like what happens if we remove this feature what happens if we remove that feature often our our favorite ideas that we tried didn't pan out and then we had to decide do we invest more time in making them work or do we just um say screw it we we don't have the the time to uh investigate it deeper we have to push the things that work and um often it was the right decision to ignore our crazy fun ideas and just do the thing that works better and what were the top two things out of all of this stuff that you did that you think moved the needle the most so the things that were most important I guess in our implementation were first of all the scoring process so uh scoring with the augmentations which allowed us to select uh the solutions reliably so this was one thing that gave a big jump I think it was from 30 to 37 at that point when we implemented it and the other thing I think what is AD depth first search algorithm because it gave us such a performance game that we could change a lot things after like using a bigger model doing more augmentations and things like that uh that we would couldn't have done without it well that's interesting because part part of experimentation is actually having a useful signal and you're saying that certain features actually unlocked a signal on other things you could try yes the DFS was a very natural extension of the scoring I think um after we had a good working scoring and selection algorithm the challenge became to generate good candidates and uh we tried a lot of stuff to do that um some some llm stuff and some external stuff like combining Snippets of different solutions to generate new Solutions um but what really worked well was Daniel's DFS algorithm it was very quick and gave us so many different candidates we could score and very few bad candidates so um yeah it the two approaches worked very well together it was a very synergistic approach and did did you ever see any um catastrophic failure modes with the DFS where it would just search for an unreasonable amount of time before hitting the threshold uh no not really because we limited the search depth to 10% oh it essentially happened that it uh searched uh quite a lot of solutions there sometimes but uh we set the limit in such a way that it didn't lead to a big problem because the probability Mass has to distribute of all paths right so if the the lower limit is 10% then there can't be more than 10 paths that are valid at the same time do you think it would be possible before you do the depth first search to predict what the computational budget is or to predict what what the search tree might look like oh that's an interesting question I think it might be quite difficult because you don't know and until you have actually done some searching so could probably predict it in an estimation after you've searched a part of it uh or you could could go for iterative deepening like you first go to 10% then to 5% something if you want to search deeper that would also be a possibility if you want to limit the compute but I'm just thinking you could take the um you you prompt the model with that solution and then you could just use a classifier or something and train the classifier on different topologies of trees that you found previously and you might be able to sort of predict a prior what how how much is this going to take me you know how how much computation should I invest in this problem if we have the solution then that works like you can you can just uh check the the logits for each pixel and then measure basically the entropy like how distributed is the probability and if the probability is distributed enough then that is a branch in the tree but then you only have like the one path through the tree and then branches from there um predicting the whole tree is probably basically equivalent to predicting the correct solution in some ways like you have to know the solution to know what the tree looks like we didn't touch on this before so you were doing a minimum entropy search but what would happen
2025-02-16 22:09