I’m skeptical that it could ever be possible to tell the difference between a hallucination and a “fact” in terms of what’s going on inside the model. Because hallucinations aren’t really a bug in the usual sense. Ie, there’s not some logic wrong or something misfiring.
Instead, it’s more appropriate to think of LLMs as always hallucinating. And sometimes that comes really close to reality because there’s a lot of reinforcement in the training data. And sometimes we humans infer meaning that isn’t there because that’s how humans work. And sometimes the leaps show clearly as “hallucinations” because the patterns the model is expressing don’t match the patterns that are meaningful to us. (Eg when they hallucinate strongly patterned things like URLs or academic citations, which don’t actually point to anything real. The model picked up the pattern of what such citations look like really well, but it didn’t and can’t make the leap to linking those patterns to reality.)
Not to mention that a lot of use cases for LLMs we actually want “hallucination”. Eg when we ask it to do any creative task or make up stories or jokes or songs or pictures. It’s only a hallucination in the wrong context. But context is the main thing LLMs just don’t have.
> Instead, it’s more appropriate to think of LLMs as always hallucinating.
That matches my mental model as well. To get rid of hallucinations, "I don't know" would have to be an acceptable answer, and it would have to output that when 'appropriate' ... Which, it doesn't know (and to be fair, neither do we most of the time, without some way of checking/validating(.