In kernel methods the kernel is typically given, and things like positional embe...

In kernel methods the kernel is typically given, and things like positional embeddings, layer normalization, causal masking, and so on are missing. Kernel methods did not take off partly due to their computational complexity (quadratic in sample size), and transforms did precisely because they were parallelizable, and thus computationally efficient, compared with the RNNs and LSTMs that came before them.

Reductions of one architecture to another are usually more enlightening from a theoretical perspective than a practical one.