Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%
Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results



It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.


It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.


Just a clarification, they tuned on the public training dataset, not the semi-private one. The 87.5% score was on the semi-private eval, which means the model was still able to generalize well.

That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.

In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).


They definitely did or they probably did? Is there any source for that just so I can point It out to people?


I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: