It works like this: First, convert the input text to a sequence of token numbers...

It works like this:

First, convert the input text to a sequence of token numbers (2048 tokens with 50257 possible token values in GPT-3) by using a dictionary and for each token, create a vector with 1 at the token index and 0 elsewhere, transform it with a learned "embedding" matrix (50257x12288 in GPT-3) and sum it with a vector of sine and cosine functions with several different periodicities.

Then, for each layer, and each attention head (96 layers and 96 heads per layer in GPT-3), transform the input vector by query, key and value matrices (12288x128 in GPT-3) to obtain a query, key and value vector for each token. Then for each token, compute the dot product of its query vector with the key vectors of all previous tokens, scale by 1/sqrt of the vector dimension and normalize the results so they sum to 1 by using softmax (i.e. applying e^x and dividing by the sum), giving the attention coefficients; then, compute the attention head output by summing the value vectors of previous tokens weighted by the attention coefficients. Now, for each token, glue the outputs for all attention heads in the layer (each with its own key/query/value learned matrices), add the input and normalize (normalizing means that the vector values are biased and scaled so they have mean 0 and variance 1).

Next, for the feedforward layer, apply a learned matrix, add a learned vector and apply a ReLU (which is f(x) = x for positive x and f(x) = kx with k near 0 for negative x), and do that again (12288x49152 and 49152x12288 matrices in GPT-3, these actually account for around 70% of the parameters in GPT-3), then add the input before the feedforward layer and normalize.

Repeat the process for each layer, each with their own matrices, passing the output of the previous layer as input. Finally, apply the inverse of the initial embedding matrix and use softmax to get probabilities for the next token for each position. For training, train the network so that they are close to the actual next token in the text. For inference, output a next token according to the top K tokens in the probability distribution over a cutoff and repeat the whole thing to generate tokens until an end of text token is generated.