They definitely aren't doing the timing properly, but also what you might think ...

artemisart · on Oct 17, 2024

Important precision: the async part is absolutely not python specific, but comes from CUDA, indeed for performance, and you will have to use cuda events too in C++ to properly time it.

For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.

godelski · on Oct 17, 2024

Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.

I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues