In my experience (with GPT-4 at least), a temperature of 0 does not result in deterministic output. It's more consistent but outputs do still vary for the same input. I feel like temperature is a bit more like "how creative should the model be?"
One theory is it is caused by its Sparse MoE (Mixture of Experts) architecture [1]:
> The GPT-4 API is hosted with a backend that does batched inference. Although some of the randomness may be explained by other factors, the vast majority of non-determinism in the API is explainable by its Sparse MoE architecture failing to enforce per-sequence determinism.