Sorry, are you familiar with what a *next token distribution* is, mathematically...

fauigerzigerk · 2026-03-04T09:07:12 1772615232

I think they are questioning whether human feedback is even necessary to make progress, i.e. whether the premise that RL needs to be RLHF is true.

My (limited) understanding is that LLMs are not capable of escaping their learned distribution by simply feeding on their own output.

But the question is whether the required external (out of distribution) "stimulus" needs to come from humans.

Could LLMs design experiments/interventions to get feedback from their environment like human scientists would?

I have my doubts that this is possible without an inherent causal reasoning capability but I'm not sure.