This is the tech report for a model I helped work on. I'm biased, but it turned out very well.
We essentially let the model learn to retrieve like a human would: Make a first search, read the results, and then make another. This lets the model be vastly better than pre-programmed pipelines. We test this extensively and compare against implementing this with API models (like Sonnet 4.5 and GPT-5.1). SID-1 compares favorably.
Happy to answer any questions or get feedback. First and foremost: Enjoy the read. It's much more detailed than most tech reports.
The weirdest thing people do is make up criteria that YC supposedly uses to reject people. There was such a huge diversity in our batch: From 20 y/o to 40+. Foreign, domestic. Credentialed, not credentialed. $1M rev run rate, $0 run rate. Just apply.
The abstract and the rest of the paper don't really match imo. It's not really allocating more to some sequences, but just introducing ~dropout. Might be different sides to the same coin, but was still a weird read.
We spent a fair bit of effort ensuring we were accurate with the language and claims, so we're happy to take any feedback and make updates in subsequent versions. However, I don't see where we claim that MoD allocates more to some sequences and not others (specifically, the abstract says "transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence".
That said, it's a pretty simple change to make the approach work in the way you describe (allocating more to some sequences and not others) by changing the group across which the top-k works. In the paper we use the time (sequence) dimension, but one could also use the batch * time dimension, which would result in asymmetric allocation across sequences
reply