When doing inference for an LLM, there are two stages. The first phase is referr...

mmoskal · on Sept 19, 2024

Decode speed is generally memory bandwidth bound. Prefill is typically arithmetic bound. This is the reason for mixed batches (both decode and prefill) - it let's you saturate both memory and arithmetic.

Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk.

I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long.

dacox · on Sept 19, 2024

Chunked prefill or some similar technique is also necessary for serving long context requests where there is not enough GPU memory available, regardless of concerns about latency.

For example, consider a prompt sent to Llama 3.1 405B that uses 128k input tokens.

The KV cache will be 123GB. No matter how many GPUs you shard the model across, you are not fitting that KV cache in GPU memory (a H100 has 80GB)

mmoskal · on Sept 21, 2024

You can do tensor parallelism 8 ways (8 KV heads). You can also do pipeline parallelism (there is 126 layers). Either way would work. A million tokens is possible just very slow.

Also, 405b has 8 KV heads of 128 size (hidden_size/num_attention_heads) times 126 layers [0] times 2 (K and V) times 2 bytes (bf16) is 504k per token. At FP8 it's 252k.

[0] https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/blob/m...

easygenes · on Sept 18, 2024

It is also a training issue. The model has to be trained to reinforce longer outputs, which has a quadratic train-time cost and requires suitable long-context response training data.

dacox · on Sept 19, 2024

They definitely have to be trained to reinforce longer outputs, but I do not believe this adequately explains the low-ish generation limits.

We are starting to see models with longer and longer generation limits (gpt-4o-mini having 16k, the o1 models going up to 64k), as well as longer and longer context limits (often 128k, google offering a million).

I find it very unlikely they are actually training with inputs or outputs near these maximums.

If you want to convince yourself, do the attention calculation math for these sequence lengths.

You can also see how openai restricts the sequence length for fine tuning to 64k - almost certainly bound by available GPU sizes

I suspect the 4096 limits have been set as a "reasonable" limit for a myriad of reasons.

jcoc611 · on Sept 18, 2024

That's a great explanation, thank you!