But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?
Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake
OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!
> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.