As a user of an ARM Mac, I wonder: how much effort does it take to get such opti...

rbultje · 2025-02-23T13:40:44 1740318044

> If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

Not ported, but rather re-implemented. So: yes.

A bit more detail: during build, on x86, the FFmpeg binary would include hand-written AVX2 (and SSSE3, and AVX512, etc.) implementations of CPU-intensive functions, and on Arm, the FFmpeg binary would include hand-written Neon implementations (and a bunch of extensions; e.g. dotprod) instead.

At runtime (when you start the FFmpeg binary), FFmpeg "asks" the CPU what instruction sets it supports. Each component (decoder, encoder, etc.) - when used - will then set function pointers (for CPU-intensive tasks) which are initialized to a C version, and these are updated to the Neon or AVX2 version depending on what's included in the build and supported by this specific device.

So in practice, all CPU-intensive tasks for components in use will run hand-written Neon code for you, and hand-written AVX2 for me. For people on obscure devices, it will run the regular C fallback.

saagarjha · 2025-02-23T08:51:02 1740300662

While the instructions are different, every platform will have some implementation of the basic operations (load, store, broadcast, etc.), perhaps with a different bit width. With those you can write an accelerated baseline implementation, typically (sometimes these are autogenerated/use some sort of portable intrinsics, but usually they don't). If you want to go past that then things get more complicated and you will have specialized algorithms for what is available.