Not a robotics guy, but to extent that the same fundamentals hold—
I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.
All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.
This understates the complexity of the problem. I have built a career modeling/learning entity behavior in the physical world at scale. Language is almost a trivial case by comparison.
Even the existence of most relationships in the physical world can only be inferred, never mind dimensionality. The correlations are often weak unless you are able to work with data sets that far exceed the entire corpus of all human text, and sometimes not even then. Language has relatively unambiguous structure that simply isn't the norm in real space-time data models. In some cases we can't unambiguously resolve causality and temporal ordering in the physical world. Human brains aren't fussed by this.
There is a powerful litmus test for things "AI" can do. Theoretically, indexing and learning are equivalent problems. There are many practical data models for which no scalable indexing algorithm exists in literature. This has an almost perfect overlap with data models that current AI tech is demonstrably incapable of learning. A company with novel AI tech that can learn a hard data model can demonstrate a zero-knowledge proof of capability by qualitatively improving indexing performance of said data models at scale.
Synthetic "world models" so thoroughly nerf the computer science problem that they won't translate to anything real.
But we don't need to know all the things that could happen if M joints moved in every possible way at the same time. We operate within normal constraints. When you see someone trip on a sidewalk and recover before falling on their face, that's still a physical system taking signals and suggesting corrections that could be simulated in a relatively straightforward newtonian virtual reality, and trained a billion times on with however many virtual joints and actuators.
In terms of "world building", it makes sense for the "world" to not be dreamed up by an AI, but to have hard deterministic limits to bump up against in training.
I guess what I mean is that humans in the world constantly face a lot of conditions that can lead to undefined behavior as well, but 99% of the time not falling on your face is good enough to get you a job washing dishes.
Also not a robotics guy, but that all sounds right to me...
What I do have deep experience in is market abstractions and jobs to be done theory. There are so many ways to describe intent, and it's extremely hard to describe intent precisely. So in addition to all the dimensions you brought up that relate to physical space, there is also the hard problem of mapping user intent to action with minimal "error", especially since the errors can have big consequences in the physical world. In other words, the "intent space" also has many dimensions to it, far beyond what LLMs can currently handle.
On one end of the spectrum of consequences is the robot loads my dishwasher such that there is too much overlap and a bunch of the dishes don't get cleaned (what I really want is for the dishes to be clean, not for the dishes to be in the dishwasher), and on the other end we get the robot that overpowers humanity and turns the universe into paperclips.
So maybe we have to master LLMs and probably a whole other paradigm before robots can really be general purpose and useful.
As I could see, classic methods (used in children teaching) could create at least magnitude more data than we have now, just paraphrasing text (classic NLP), but depends on language (I'll try explain).
Text really have lot of degrees of freedom, but depends on language, and even more on type of alphabet - modern English with phonetic alphabet is worst choice, because it is simplest, nearly nobody use second-third hidden meaning (I hear about 2-3 to 5-6 meanings depending on source); hieroglyphic languages are much more information rich (10-22 meanings); and what is interest, phonetic languages in totalitarian countries (like Russian) are also much more rich (8-12 meanings), because they used to hide few meanings from government to avoid punishment.
Language difference (more dimensions) could be explanation of current achievements of China, superior to Western, and it could also be hint, on how to boost Western achievements - I mean, use more scientists from Eastern Europe and give more attention to Eastern European languages.
For 3D robots, I see only one way - computational simulated environment.
Even though the system rules and I/O are tightly constrained, they're still struggling to match human performance in an open-world scenario, after a gigantic R&D investment with a crystal clear path to return.
Fifteen years ago I thought that'd be a robustly solved problem by now. It's getting there, but I think I'll still need to invest in driving lessons for my teenage kids. Which is pretty annoying, honestly: expensive, dangerous for a newly qualified driver, and a massive waste of time that could be used for better things. (OK, track days and mountain passes are fun. 99% of driving is just boring, unnecessary suckage).
What's notable: AVs have vastly better sensors than humans, masses of compute, potentially 10X reaction speed. What they struggle with is nuance and complexity.
Also, AVs don't have to solve the exact same problems as a human driver. For example, parking lots: they don't need to figure out echelon parking or multi-storey lots, they can drop their passengers and drive somewhere else further away to park.
> in practice we currently have exponentially less real-world data for an exponentially harder problem
Is that where learning comes in? Any actual AGI machine will be able to learn. We should be able to buy a robot that comes ready to learn and we teach it all the things we want it to do. That might mean a lot of broken dishes at first, but it's about what you would expect if you were to ask a toddler to load your dishes into the dishwasher.
My personal bar for when we reach actual AGI is when it can be put in a robot body that can navigate our world, understand spatial relationships, and can learn from ordinary people.
I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.
All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.
Real roboticists should chime in...