> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.
"Raising visibility on this note we added to address ARC "tuned" confusion:
> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.
This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.
The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.
Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.
To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.
While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.
SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.
Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.
Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
Or maybe benchmarks are just bad at measuring intelligence in general.
Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?
> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).
If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.