I frequently make this point so I may sound like a broken record, but I believe this is more or less a fallacy.
Again, the point isn’t that better assembly couldn’t be written. It’s that it most likely wouldn't be significantly better than the compiler because all of the suggestions are things compilers would be doing anyways. There are some cases where this isn’t true, especially when dealing with vectorization, but those are mostly just exceptions (and intrinsics often offer easier ways to do such optimizations...)
But here’s the point that I feel is often ignored when it comes to programming language debates in general: just because you are experienced with and aware of advanced usages of the environment you’re programming in, does not mean the complexity and especially cognitive overhead of said complexity has disappeared. Looking at the C version, it doesn’t really look especially optimized, which is not really something that you would see in assembler, at least not in my opinion. Complexity adds up over time; abstractions are the antidote to that problem.
On top of that, assembly language is obviously not portable, which IMO is even more reason to use a high level language and drop to asm only when needed; you can easily swap implementations and have a fallback for architectures that aren’t specifically optimized.
If you don’t study and practice it, how can you possibly know how to write good code (in any language)? In assembly, if you’ve never been exposed to eg the simd instructions, or aligned memory access or cache or branch prediction or instruction level parallelism, how can you expect to write performant assembly code? Experience and knowledge doesn’t just appear.
I’m not arguing that it’s worth it or that it’s easy to beat the compiler. I certainly am not going to bother writing assembly (maybe some intrinsic for SIMD but certainly not raw assembly, outside of embedded systems, although even then it’s not really worth it usually).
I’m simply saying that you can’t expect to be good at something unless you practice it.
But that doesn’t mean people shouldn’t try but to learn. For example, somebody has to implement the optimisations isn’t he compiler and that person needs to have a great understanding of how to produce high performance assembly code. Plus learning new things is always worthwhile if you have the time.
In Assembly, even if you manage to beat the compiler, it might be a pyrrhic victory, because it might be lost when trying the same benchmark in another CPU or after getting a microcode update.
During the 80's and early 90's it was a different matter, because CPUs were dumb, hardware was relatively static specially on 8 and 16 bit consumer systems and high level optimizers were pretty dumb given the resource constraints of those platforms.
I’m not debating whether or not its a worthy endeavour though, I’m only saying that you can’t expect good performance out of assembly code unless you practice writing high performance assembly code. Most of us have a lot of experience with high level languages, so that we can write well performing high level code makes a lot of sense, but we shouldn’t expect that we can just “drop down to assembly” and get a performance boost, but that also doesn’t mean that its never possible, for the people who do actually do this a lot (eg the x264 people writing hand crafted SSE/AVX code)
Which is only a tiny subset of all the opcodes that a modern Intel CPU is able to understand, let alone what AMD also offers.
You need tools like VTune from each CPU vendor to actually understand the CPU clock timings of each opcode in micro-ops (microcode execution unit).
While you can master a specific subset, like knowing AVX instructions, mastering Assembly back to back like in the old days, only when writing Assembly for stuff like small PIC microcontrollers.
Trying to master a language like C++ is easier, which says a lot about how modern CPUs look like.
Again, the point isn’t that better assembly couldn’t be written. It’s that it most likely wouldn't be significantly better than the compiler because all of the suggestions are things compilers would be doing anyways. There are some cases where this isn’t true, especially when dealing with vectorization, but those are mostly just exceptions (and intrinsics often offer easier ways to do such optimizations...)
But here’s the point that I feel is often ignored when it comes to programming language debates in general: just because you are experienced with and aware of advanced usages of the environment you’re programming in, does not mean the complexity and especially cognitive overhead of said complexity has disappeared. Looking at the C version, it doesn’t really look especially optimized, which is not really something that you would see in assembler, at least not in my opinion. Complexity adds up over time; abstractions are the antidote to that problem.
On top of that, assembly language is obviously not portable, which IMO is even more reason to use a high level language and drop to asm only when needed; you can easily swap implementations and have a fallback for architectures that aren’t specifically optimized.