Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

l23k4 · 2026-05-28T06:07:33 1779948453

LLM written.

See the authors twitter, he speaks english at a rather basic level and certainly did not write this https://x.com/mohitgeryani/with_replies

geek_at · 2026-05-28T06:13:50 1779948830

https://xcancel.com/mohitgeryani/with_replies

Mashimo · 2026-05-28T06:18:40 1779949120

Also I'm pretty sure the original source was linked here on HN before.

blahblaher · 2026-05-28T07:32:09 1779953529

like 90% of the rest of the internet.

big-chungus4 · 2026-05-28T06:45:39 1779950739

This article was generated from the original Qwen3.7-Max release blogpost and contains nothing new https://qwen.ai/blog?id=qwen3.7

keyle · 2026-05-28T06:09:25 1779948565

I don't doubt that it did it but I wouldn't want to maintain whatever it ended up spewing after 35 hrs.

In my experience, AI fixes problems by mostly adding more code.

It's a short term gain for a long term hurt.

userbinator · 2026-05-28T06:16:10 1779948970

In my experience, AI fixes problems by mostly adding more code.

In my experience, humans unfortunately tend to do the same.

Iolaum · 2026-05-28T06:28:38 1779949718

The LLM's had to learn that from somewhere :p

Balinares · 2026-05-28T06:51:02 1779951062

Some do, but we should not be level-setting at mediocrity.

bahmboo · 2026-05-28T06:22:06 1779949326

How do you know that? What information do you have that would explain your position? We are talking about a specific circumstance and you have brought unsupported generalities to the discussion.

skew-aberration · 2026-05-28T06:35:20 1779950120

I've had a very similar experience optimising a hidden markov model prediction tool I work on. I wanted to experiment with an alternative architecture and data structures. Opus 4.7 did the refactor, and eventually the only hot spot became the maths kernel. Over the course of an hour or two it iteratively rewrote that code using all the usual optimisations to improve branching, cache usage, vectorisation, etc. It reviewed the disassembly and the hardware counters with perf to verify that the changes were working as intended. It could have taken me several days to cover that much ground doing low level optimisations - and I would have spent most of it grappling with gcc, perf, searching for information about particular SIMD instructions, etc.

zerof1l · 2026-05-28T11:13:27 1779966807

The article gives no mention of what exactly was done to achieve the speedup and whether or not the kernel is still able to perform the same function as before.

I’m doubtful this is a meaningful result. Kernel contains a lot of legacy code and generalizations to support different hardware etc.; removing that would result in a speedup. Next are all the mitigations for hardware vulnerabilities and attacks. If removed would give a nice speedup as well at the cost of security. And then finally, just specializing the Kernel in whatever the benchmark is measuring, making it useless as a general piece of software would also make it fast.

miloignis · 2026-05-28T11:51:15 1779969075

The article is talking about "Kernel" as in a low level piece of code to compute math, in this case extended attention for running LLMs on a GPU or accelerator, not as in the Linux Kernel.

trilogic · 2026-05-28T06:22:52 1779949372

Don´t give up on native agents, best logic will prevail. The open weights will show the real deal.

mannyv · 2026-05-28T05:50:00 1779947400

At this point the models should just start improving themselves.

singingtoday · 2026-05-28T06:12:06 1779948726

Rumor is that anthropic writes all their code with Claude. So it kind of is.

teravor · 2026-05-28T06:45:14 1779950714

so basically just brute force the kernel.

there are more elegant ways to leverage an LLM, see AlphaEvolve: https://arxiv.org/abs/2506.13131

it's difficult to frame most coding tasks in such a way where you can trivially verify correctness.

mosselman · 2026-05-28T06:35:03 1779950103

what a nonsense, generated, article.

> For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.

By this reasoning I could release a model that lacks all the basic optimisations. Have it optimise itself for hours to reach 20x the throughput and then claim that the model is superior to the others?

I am not saying that is what happened here, but the reporting is abysmal.

rurban · 2026-05-28T06:50:57 1779951057

It is not the model's job to stop or continue, it's the agent. Qwen has nothing to do with it.

Right now now I switched to the latest codewhale agent (in Rust), and it would perform much better according to his qualifications. Much better async IO implementation and orchestration, no more deadlocks as in the typical typescript tooling. It just doesnt stop out the blue, as claude, kimi or opencode.

big-chungus4 · 2026-05-28T06:43:48 1779950628

It optimized the Extend Attention operator in triton. All models were optimizing the same operator

hobofan · 2026-05-28T06:40:57 1779950457

They didn't optimize their own kernels and optimize their own runtime, which I think is what you are implying.

yjftsjthsd-h · 2026-05-28T05:47:40 1779947260

Obligatory: Either written by AI or by a human who has spent so much time with AI that they adopted its writing style. Anyways.

> Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.

Anyone remember genetic algorithms? This might be an improvement, but it still feels a little like deja vu.

thatoneguy · 2026-05-28T06:09:29 1779948569

Yeah, I remember. I still have Usenet postings about the genetic algorims conference back in the '90s and some magazine clippings about researcher from the University of Sussex where I first learned about genetic algorithms back in high school.

dist-epoch · 2026-05-28T06:26:05 1779949565

Genetic algorithm is random. This is intelligent evolution. Big difference.

Kim_Bruning · 2026-05-28T06:56:20 1779951380

I got nerd-sniped wrt the genetic algorithm.

Technically birdshot from a shotgun is also randomly distributed (passing through a cone). This actually improves the chance of hitting the clay pigeon, because the birdshot spreads out and each individual ball has a chance to hit.

Genetic algo is similar. it's an optimizer that - in order to avoid local optima - will 'shotgun' an area around its current best guess.

dist-epoch · 2026-05-28T09:39:50 1779961190

Yeah, but you shoot in the direction of the clay pigeon, you don't pick randomly a direction in space where to point your gun at.

Kim_Bruning · 2026-05-28T13:27:04 1779974824

Compare genetic approach to greedy? With greedy approach you always take the next lower energy value no matter what, right? But if you end up in a local basin, you'll just never escape, because you never look further than your direct environs.

So instead of just sampling in a close 'circle' around your current point looking for a 'down', how about we spread that out a bit? You could use a 'circle' in a regular pattern, but what does that even look like in high dimensional space? Seems it's best to use some random distribution centered on your current position.

(LLMs actually have a 'temperature' setting which introduces noise for this exact reason.)

Some of GA's claims to fame are A) it uses purely just this distribution to descend. B) It can find multiple optima.

The way I think of it is that the simplest GA is basically greedy optimization with spread.

Greedy is like shooting a rifle , which is great for sniping, but you'll miss if the target is moving fast or doing things you can't quite keep up with.

A GA -like a shotgun- introduces spread: multiple chances to hit, multiple chances to escape local optima and rough patches in the landscape.

(A really good -if slightly morbid- modern example in the wild is COVID; which managed to outwit human civilization rather handily. "Not bad for a bit of encapsulated RNA" you'd think - until you realize it was running trillions of attempts in parallel. Really, the poor governments had no chance. )

qrobit · 2026-05-28T06:58:42 1779951522

Both are non-deterministic, both have some metric to optimise, one is specific and efficient, the other is too broad and very expensive

rurban · 2026-05-28T06:52:49 1779951169

Temperature is the rephrasing of randomness. No difference, just much better matchers.

greenavocado · 2026-05-28T05:50:38 1779947438

Key word is non-differentiable optimization. That's what generic algorithms were traditionally good at.

Zardoz84 · 2026-05-28T06:00:24 1779948024

So a LLM wrote 432 kernel variations and it found what was the faster...