Everything Deepmind published at this year's ICML would be a good start. Transfo...

Everything Deepmind published at this year's ICML would be a good start.

Transformers (or rather the QKV attention mechanism) has taken over ML research at this point, it just scales and works in places it really shouldn't. Eg. you'd think convnets would make more sense for vision because of its translation invariance, but ViT works better even without this inductive bias.

Even in things like diffusion models the attention layers are crucial to making the model work.