Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They definitely aren't doing the timing properly, but also what you might think is timing is not what is generally marketed. But I will say, those marketed versions are often easier to compare. One such example is that if you're using GPU then have you actually considered that there's an asynchronous operation as part of your timing?

If you're naively doing `time.time()` then what happens is this

  start = time.time() # cpu records time
  pred = model(input.cuda()).cuda() # push data and model (if not already there) to GPU memory and start computation. This is asynchronous
  end = time.time() # cpu records time, regardless of if pred stores data
You probably aren't expecting that if you don't know systems and hardware. But python (and really any language) is designed to be smart and compile into more optimized things than what you actually wrote. There's no lock, and so we're not going to block operations for cpu tasks. You might ask why do this? Well no one knows what you actually want to do. And do you want the timer library now checking for accelerators (i.e. GPU) every time it records a time? That's going to mess up your timer! (at best you'd have to do a constructor to say "enable locking for this accelerator") So you gotta do something a bit more nuanced.

If you want to actually time GPU tasks, you should look at cuda event timers (in pytorch this is `torch.cuda.Event(enable_timing=True)`. I have another comment with boilerplate)

Edit:

There's also complicated issues like memory size and shape. They definitely are not being nice to the NPU here on either of those. They (and GPUs!!!) want channels last. They did [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also the issue of how memory is allocated (and they noted IO being an issue). 1500 is a weird number (as is 6) so they aren't doing any favors to the NPU, and I wouldn't be surprised that this is a surprisingly big hit considering how new these things are

And here's my longer comment with more details: https://news.ycombinator.com/item?id=41864828



Important precision: the async part is absolutely not python specific, but comes from CUDA, indeed for performance, and you will have to use cuda events too in C++ to properly time it.

For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.


Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.

I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: