Everyone here is asking about hardware encoding/decoding, which is obviously important.
My question is, at what level do these encoders work at? Are they basically specialized SIMD instructions, or are they fully featured chips that take raw data as input and produce byte streams in the format of the protocol? Or somewhere in between?
Typically it's a dedicated hardware block on the die of a larger chip, like AMD's UVD [1] and VCE [2] inside their GPU/APUs, or Nvidia's PureVideo [3] and NVENC [4]. Sometimes the line is even blurrier, like the Broadcom SOC in the Raspberry Pi, where various more general-purpose processors can work together to decode [5].
Then you have APIs that try to disassemble all the typical stages of video codec processing that applications can call and GPU drivers can implement, which serve as a bridge between hardware-assisted decode and application code. These are APIs like DXVA or one of the several in use with Linux [6].
In between, but more like the latter conceptually. It’ll be a collection of bulk data processing blocks that are glued together by software. Generally, you want to implement the large bulk operations in HW, with SW handling all the option parsing and control logic.
And this opens up the question of how easy it is to re-target exiting hardware to new techniques.
For example, if I understand the Chroma-from-Luma prediction; then the maths involved is just 2-d linear regression (once for U vs. L once for V vs L.). That's a pretty generic task. Even when with the domain-specialisation that it is done over pixels in an encoding block, we are still talking about concepts common to all codecs.
So maybe there existing acceleration hardware can already do it. But even if it can, the require primitive needs to be exposed to software if new protocols are to benefit from it. So my question (and maybe Boxxed's too) is whether the hardware interface is low-level enough for such adaptability.
My question is, at what level do these encoders work at? Are they basically specialized SIMD instructions, or are they fully featured chips that take raw data as input and produce byte streams in the format of the protocol? Or somewhere in between?