No, you can have latency that is independent of compute performance. The CPU/GPU may have other tasks and the work has to wait for the existing threads to finish, or for them to clock up, or have slower memory paths, etc.
If you and I have the same calculator but I'm working on a set of problems and you're not, and we're both asked to do some math, it may take me longer to return it, even though the instantaneous performance of the math is the same.
The GPU is stateful and requires loading shaders and initializing pipelines before doing any work. That is where its latency comes from. It is also extremely power hungry.
The CPU is zero latency to get started, but takes longer because it isn't specialized at any one task and isn't massively parallel, so that is why the CPU takes even longer.
The NPU often has a simpler bytecode to do more complex things like matrix multiplication implemented in hardware, rather than having to instantiate a generic compute kernel on the GPU.