> Is handwritten assembly faster than GCC/clang-written assembly?
Sometimes, but the biggest case is if you can carefully arrange a tight inner loop, especially one that case make use of SIMD, like some DSP and scientific-computing code. Auto-vectorizers are getting better, but still miss lots of cases, so a skilled asm programmer can beat the compiler. The more "spread out" the performance-critical code is, in general (i.e. performance not dominated by one or two tight loops), the harder it is for hand-coding asm to beat a compiler; humans are not that good at doing whole-program optimization on large codebases. The more cross-platform the code has to be, the worse for the asm programmer as well: beating gcc's code-gen on one architecture is easier than beating it everywhere.
Sometimes, but the biggest case is if you can carefully arrange a tight inner loop, especially one that case make use of SIMD, like some DSP and scientific-computing code. Auto-vectorizers are getting better, but still miss lots of cases, so a skilled asm programmer can beat the compiler. The more "spread out" the performance-critical code is, in general (i.e. performance not dominated by one or two tight loops), the harder it is for hand-coding asm to beat a compiler; humans are not that good at doing whole-program optimization on large codebases. The more cross-platform the code has to be, the worse for the asm programmer as well: beating gcc's code-gen on one architecture is easier than beating it everywhere.