Likely not. We’re seeing a massive slowing in the value of all that additional t...

mangoman · 2026-03-24T14:05:27 1774361127

I dunno, I thought that too for a while too, but there are a lot of new ideas in terms of architecture that may warrant massive training runs. Mamba and state space models are pretty interesting, but haven’t had their transformer moment yet because I haven’t really seen anyone go for broke on training it with a huge data set and model size. Even some of the more fundamental changes too like Kolmogorov–Arnold Networks or some of the ideas behind continuous back propagation haven’t really had the opportunity to be pushed to the limit. I think it’s still early days on what these models can do. And I say this as someone who bought a Mac m3 max 128gb ram, based on the hope that the on device training and inference work would eventually move locally. It’s encouraging to see the progress though and I hope it does move locally though.

parineum · 2026-03-24T14:38:11 1774363091

> but there are a lot of new ideas in terms of architecture that may warrant massive training runs

I don't think the argument is that isn't true, it's that the gains from those massive training runs is diminishing. Eventually, it won't be worth it to do the run for each new idea, you'll have to bundle a bunch together to get any noticeable change.

tmzt · 2026-03-25T18:44:13 1774464253

Same here. Then you see SOTA in a browser from Ex0byt, online 10x training (JIT-Lora), TurboQuant (Google), etc. Just saw KV prediction mentioned in this thread, so looking into that too.

I'm adapting all of this to Rust+WGPU with compute shaders if you want to follow along.

See this repo: https://github.com/tmzt/shady-thinker

Goal is Qwen3.5 27b on a Pixel 10 Pro running GrapheneOS.