Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
> the other is that there was an approach that made the task easier than we expected.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
We are trick generators, that is what it means to be a general intelligence. Adding another trick in the bag doesn't make you a general intelligence, being able to discover and add new tricks yourself makes you a general intelligence.
Not the parent, but remembering my reading of Dennett, he was referring to the tricks that we got through evolution, rather than ones we invented ourselves. As particular examples, we have neural functional areas for capabilities like facial recognition and spatial reasoning which seems to rely on dedicated "wetware" somewhat distinct from other parts of the brain.
But humans being able to develop new tricks is core to their intelligence, saying its just a bag of tricks means you don't understand what AGI is. So either the poster misunderstood Dennett or Dennett weren't talking about AGI or Dennett didn't understand this well.
Of course there are many tricks you will need special training for, like many of the skills human share with animals, but the ability to construct useful shareable large knowledge bases based on observations is unique to humans and isn't just a "trick".
generating tricks is itself a trick that relies on an enormous bag of tricks we inherited through evolution by the process of natural selection.
the new tricks don't just pop into our heads even though it seems that way. nobody ever woke up and devised a new trick in a completely new field without spending years learning about that field or something adjacent to it. even the new ideas tend to be an old idea from a different field applied to a new field. tricks stand on the shoulders of giants.
Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
Id agree with you if there hasn’t been very deliberate work towards solving ARC for years, and if thr conceit of the benchmark wasn’t specifically based on a conception of human intuition being, put simply, learning and applying out of distribution rules on the fly. ARC wasn’t some arbitrary inverse set, it was designed to benchmark a fundamental capability of general intelligence
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.