I don't know why the performance characteristics are so similar. I guess once you have a big fat memory subsystem on-die, everything else is diminishing returns?
I don't find that to be true. As Andrew is saying, the performance characteristics only look similar because the software is rarely written to optimize performance on modern processors. I have a 5-year old overclocked Nehalem that can beat most current Haswell's on single-threaded code of this sort just because of the higher clock speed. But for algorithms designed for the capabilities of the newer instruction sets, the different generations really start to distinguish themselves.
Instead of ~5% improvements between Intel generations on standard benchmarks, you can sometimes get 50% or more for architecture specific algorithms. From Nehalem to Sandy Bridge to Haswell, the maximum per-cycle reads from L1 has gone from 16B to 32B to 64B. This means that approaches that would have been silly 5 years ago (like 16KB lookup tables from which you need to read 32B every cycle to sustain throughput) can be practical now.
I don't find that to be true. As Andrew is saying, the performance characteristics only look similar because the software is rarely written to optimize performance on modern processors. I have a 5-year old overclocked Nehalem that can beat most current Haswell's on single-threaded code of this sort just because of the higher clock speed. But for algorithms designed for the capabilities of the newer instruction sets, the different generations really start to distinguish themselves.
Instead of ~5% improvements between Intel generations on standard benchmarks, you can sometimes get 50% or more for architecture specific algorithms. From Nehalem to Sandy Bridge to Haswell, the maximum per-cycle reads from L1 has gone from 16B to 32B to 64B. This means that approaches that would have been silly 5 years ago (like 16KB lookup tables from which you need to read 32B every cycle to sustain throughput) can be practical now.