Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.
No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level.
Sure, but, aim for the stars and you hit the moon right?
Like fundamentally who cares? For the purpose of an AGI benchmark I'd argue you'd rather err on the side of being more intelligent and counting that as less intelligent than vice versa.
Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?
I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).
If I had a puzzle I really needed solved, then I would not ask a rando on the street, I would ask someone I know is really good at puzzles.
My point is: For AGI to be useful, it really should be able to perform at the top 10% or better level for as many professions as possible (ideally all of them).
An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
It seems they don't test for that, since they use the second-best human solution as a baseline.
And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned.
Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov.
> The only reason you remember is because it beat Kasparov
There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book.
It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time.
This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans.
> Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application.
This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.
If you only outdo humans 50% of the time you're never going to get consensus on if you've qualified. Whereas outdoing 90% of humans on 90% of all the most difficult tasks we could come up with is going to be difficult to argue against.
This benchmark is only one such task. After this one there's still the rest of that 90% to go.
Beating humans isn't anywhere near sufficient to qualify as ASI. That's an entirely different league with criteria that are even more vague.
Even dumb humans are considered to have general intelligence. If the bar is having to outdo the median human, then 50% of humans don't have general intelligence.
Not true. We don't have a good definition for intelligence - it's very much an I'll know it when I see it sort of thing.
Frontier models are reliably providing high undergraduate to low graduate level customized explanations of highly technical topics at this point. Yet I regularly catch them making errors that a human never would and which betray a fatal lack of any sort of mental model. What are we supposed to make of that?
It's an exceedingly weird situation we find ourselves in. These models can provide useful assistance to literal mathematicians yet simultaneously show clear evidence of lacking some sort of reasoning the details of which I find difficult to articulate. They also can't learn on the job whatsoever. Is that intelligence? Probably. But is it general? I don't think so, at least not in the sense that "AGI" implies to me.
Once humanity runs out of examples that reliably trip them up I'll agree that they're "general" to the same extent that humans are regardless of if we've figured out the secrets behind things such as cohesive world models, self awareness, active learning during operation, and theory of mind.
> Yet I regularly catch them making errors that a human never would
I have yet to see a "error" that modern frontier models make that I could not imagine a human making - average humans are way more error prone than the kind of person who posts here thinks, because the social sorting effects of intelligence are so strong you almost never actually interact with people more than a half standard deviation away. (The one exception is errors in spatial reasoning with things humans are intimately familiar with - for example, clothing - because LLMs live in literary space, not physics space, and only know about these things secondhand)
> and which betray a fatal lack of any sort of mental model.
This has not been a remotely credible claim for at least the past six months, and it seemed obviously untrue for probably a year before then. They clearly do have a mental model of things, it's just not one that maps cleanly to the model of a human who lives in 3D space. In fact, their model of how humans interact is so good that you forget that you're talking to something that has to infer rather than intuit how the physical world works, and then attribute failures of that model to not having one.
> you almost never actually interact with people more than a half standard deviation away
I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.
> This has not been a remotely credible claim for at least the past six months
Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.
I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.
I think you are getting caught up on the intelligence part. That is the easy part since AGI doesn't have to be intelligent, it just has to be intelligence. If you look at early chess AI you will see that they are very weak compared to even a beginner human. The level of intelligence does not matter for a chess bot to be considered AI. It is that it is emulating intelligence that makes it AI.
>But is it general? I don't think so
I would consider it as general due to me being able to take any problem I can think of and the AI will make an attempt to solve it. Actually solving it is not a requirement for AGI. Being able to solve it just makes it smarter than an AGI that can't. You can trip up chess AI, but that don't stop them from being AI. So why apply that standard to AGI?
How am I getting caught up on it? I acknowledged that I think frontier models qualify as intelligent but disputed the "general" part. In fact for quite a few years now there have been many non-frontier models that I also consider intelligent within a very narrow domain.
I think stockfish reasonably qualifies as superhuman AI but not even remotely "general". Similarly alphafold.
> Actually solving it is not a requirement for AGI.
I think I see what you're trying to get at but taken as worded that can't possibly be right. Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.
It’s not that simple since each problem is supposed to be distinct and different enough that no single program can solve multiple of them properly. No problem spec is provided as well iiuc so you can’t simply ask an LLM to generate code without doing other things.
A human can sit down to play a game with unknown rules and write a spec as he goes. If a model can't even figure out to attempt that, let alone succeed at it, then it most certainly isn't an example of "general" intelligence.
> A human can sit down to play a game with unknown rules and write a spec as he goes.
Some humans can. Many, if not most humans cannot. A significant enough fraction of humans have trouble putting together Ikea furniture that there are memes about its difficulty. You're vastly overestimating the capabilities of the average human. Working in tech puts you in probably the top ~1-5% of capability to intuit and understand rules, but it distorts your intuition of what a "reasonable" baseline for that is.
Yes, I am aware. However an idealized human can do so. Analogously, there are plenty of humans that can't run an 8 minute mile but if your bipedal robot is physically incapable of ever doing that then it isn't reasonable to claim having achieved human level athletic performance. When it can compete in every Olympic event you can claim human level performance at athletics in general.
If the model can't generalize to arbitrary tasks on its own without any assistance then it doesn't qualify as a general intelligence. AGI to my mind means meeting or exceeding idealized human performance on the vast majority of arbitrary tasks that are cherrypicked to be particularly challenging.
It's not obvious at all. And I would say pretty much impossible without using machine learning. Even for ARC-AGI-1 there is no GOFAI program that scores high.
People are still debating whether these models exhibit any kind of intelligence and any kind of thinking. Setting the bar higher then necessary is welcome, but at this point I’m pretty sure everyone’s opinions are set in stone.
In retrospect, it seems obvious that we hit AGI by a reasonable "at least as intelligent as some humans" definition when o3 came out, and everything since then has been goalpost moving by people who have higher and higher bars for which percentile human they would be willing to employ (or consider intellectually capable). People should really just use the term "ASI" when their definition of AGI excludes the majority of humans.
Edit: Here's the guy who coined the term saying we're already there. Everything else is arguing over definitions.
> Well, Lars, I INVENTED THE TERM and I say we have achieved AGI. Current models perform at roughly high-human level in command of language and general knowledge, but work thousands of times faster than us. Still some major deficiencies remain but they're falling fast.
> They all make sense to me if we're trying to judge whether these tools are AGI, no?
As long as the mean and median human scores are clearly communicated, the scoring is fine. I think the human scores above would surprise people at first glance, even if they make sense once you think about it, so there's an argument to be made that scores can be misleading.
“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.