I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.
That’s not inability to reason though, that’s having a social context.
Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.
What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.