Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah this part is what makes the high performance even more surprising to me. The fact that LLMs are able to do so well on visual tasks (also seen with their ability to draw an image purely using textual output https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) implies that not only do they actually have some "world model" but that this is in spite of the disadvantage given by having to fit a round peg in a square hole. It's like trying to map out the entire world using the orderly left-brain, without a more holistic spatial right-brain.

I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.



A file is a stream of symbols encoded by bits according to some format. It’s pretty much 1D. It would be susprising that LLM couldn’t extract information from a file or a data stream.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: