ELI5 is tricky as details have to be sacrificed, but I'll try. An attention mech...

ELI5 is tricky as details have to be sacrificed, but I'll try.

An attention mechanism is when you want a neural network to learn the function of how much attention to allocate to each item in a sequence, to learn which items should be looked at.

Transformers is a self-attention mechanism, where you ask the neural network to 'transform' each element by looking at its potential combination with every other element and using this (learnable, trainable) attention function to decide which combination(s) to apply.

And it turns out that this very general mechanism, although compute-intensive (it considers everything linking with everything, so complexity quadratic to sequence length) and data-intensive (it has lots and lots of parameters, so needs huge amounts of data to be useful) can actually represent many of things we care about in a manner which can be trained with the deep learning algorithms we already had.

And, really, that's the two big things ML needs, a model structure where there exists some configuration of parameters which can actually represent the thing you want to calculate, and that this configuration can actually be determined from training data reasonably.