How can there be "private" taks when you have use the OpenAI API to run queries?...

nmca · on Dec 20, 2024

We worked with ARC to run inference on the semi-private tasks last week, after o3 was trained, using an inference only API that was sent the prompts but not the answers & did no durable logging.

idontknowmuch · on Dec 21, 2024

What's your opinion on the veracity of this benchmark - given o3 was fine-tuned and others were not? Can you give more details on how much data was used to fine-tune o3? It's hard to put this into perspective given this confounder.

nmca · on Dec 21, 2024

I can’t provide more information than is currently public, but from the ARC post you’ll note that we trained on about 75% of the train set (which contains 400 examples total); which is within the ARC rules, and evaluated on the semiprivate set.

idontknowmuch · on Dec 21, 2024

That's completely understandable - leveraging the train set. But what I was trying to say is that the comparison is relative to models that were actually zero-shot and not tuned. It isn't apples to apples, it's apples to orchards.