Also there's "levels" of hardware acceleration - using CUDA (or any other shader...

Also there's "levels" of hardware acceleration - using CUDA (or any other shader-level acceleration) will always be less efficient than a dedicated hardware block.

And there's multiple steps in decoding a video - some steps in some codecs may fit different acceleration schemes better, so it may not be worth the hardware cost for a full pipeline decode at some point, but then later transistors are cheaper, or new hw decode techniques discovered, so more steps can be done in dedicated hardware blocks. Also those hardware blocks may have hard limits - if it can only (say) cope with 1080p60 at a certain profile level for a codec, trying to do something more than that will likely just completely skip the HW block - it's hard to do any kind of "hybrid" decode if it's not a whole pipeline step.

"HW Video Decode Acceleration" isn't a simple boolean.