How do you know that? What information do you have that would explain your position? We are talking about a specific circumstance and you have brought unsupported generalities to the discussion.
I've had a very similar experience optimising a hidden markov model prediction tool I work on. I wanted to experiment with an alternative architecture and data structures. Opus 4.7 did the refactor, and eventually the only hot spot became the maths kernel. Over the course of an hour or two it iteratively rewrote that code using all the usual optimisations to improve branching, cache usage, vectorisation, etc. It reviewed the disassembly and the hardware counters with perf to verify that the changes were working as intended. It could have taken me several days to cover that much ground doing low level optimisations - and I would have spent most of it grappling with gcc, perf, searching for information about particular SIMD instructions, etc.
The article gives no mention of what exactly was done to achieve the speedup and whether or not the kernel is still able to perform the same function as before.
I’m doubtful this is a meaningful result. Kernel contains a lot of legacy code and generalizations to support different hardware etc.; removing that would result in a speedup. Next are all the mitigations for hardware vulnerabilities and attacks. If removed would give a nice speedup as well at the cost of security. And then finally, just specializing the Kernel in whatever the benchmark is measuring, making it useless as a general piece of software would also make it fast.
The article is talking about "Kernel" as in a low level piece of code to compute math, in this case extended attention for running LLMs on a GPU or accelerator, not as in the Linux Kernel.
> For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.
By this reasoning I could release a model that lacks all the basic optimisations. Have it optimise itself for hours to reach 20x the throughput and then claim that the model is superior to the others?
I am not saying that is what happened here, but the reporting is abysmal.
It is not the model's job to stop or continue, it's the agent. Qwen has nothing to do with it.
Right now now I switched to the latest codewhale agent (in Rust), and it would perform much better according to his qualifications. Much better async IO implementation and orchestration, no more deadlocks as in the typical typescript tooling. It just doesnt stop out the blue, as claude, kimi or opencode.
Obligatory: Either written by AI or by a human who has spent so much time with AI that they adopted its writing style. Anyways.
> Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.
Anyone remember genetic algorithms? This might be an improvement, but it still feels a little like deja vu.
Yeah, I remember. I still have Usenet postings about the genetic algorims conference back in the '90s and some magazine clippings about researcher from the University of Sussex where I first learned about genetic algorithms back in high school.
Technically birdshot from a shotgun is also randomly distributed (passing through a cone). This actually improves the chance of hitting the clay pigeon, because the birdshot spreads out and each individual ball has a chance to hit.
Genetic algo is similar. it's an optimizer that - in order to avoid local optima - will 'shotgun' an area around its current best guess.
Compare genetic approach to greedy? With greedy approach you always take the next lower energy value no matter what, right? But if you end up in a local basin, you'll just never escape, because you never look further than your direct environs.
So instead of just sampling in a close 'circle' around your current point looking for a 'down', how about we spread that out a bit? You could use a 'circle' in a regular pattern, but what does that even look like in high dimensional space? Seems it's best to use some random distribution centered on your current position.
(LLMs actually have a 'temperature' setting which introduces noise for this exact reason.)
Some of GA's claims to fame are A) it uses purely just this distribution to descend. B) It can find multiple optima.
The way I think of it is that the simplest GA is basically greedy optimization with spread.
Greedy is like shooting a rifle , which is great for sniping, but you'll miss if the target is moving fast or doing things you can't quite keep up with.
A GA -like a shotgun- introduces spread: multiple chances to hit, multiple chances to escape local optima and rough patches in the landscape.
(A really good -if slightly morbid- modern example in the wild is COVID; which managed to outwit human civilization rather handily. "Not bad for a bit of encapsulated RNA" you'd think - until you realize it was running trillions of attempts in parallel. Really, the poor governments had no chance. )
See the authors twitter, he speaks english at a rather basic level and certainly did not write this https://x.com/mohitgeryani/with_replies
reply