> Never rely on benchmark tests unless the benchmarking code is known to be open source and compiled without using any Intel tools.
A serious question - how many common benchmark packages are compiled by ICC or uses Intel MKL? I hope the number is limited, otherwise all the benchmarks published by mainstream PC reviewers are potentially biased. If there's a serious ICC-in-benchmark problem, then only Phoronix's Linux benchmarks are trustworthy - the majority of benchmarks on Phoronix uses free and open source compilers and testsuits, with known versions, build parameters and optimization levels. Thanks Michael Larabel for his service for the community.
This concern really only applies to synthetic benchmarks (stuff like SPEC CPU). If you're testing a commercially available application or game as delivered to consumers, this issue does not invalidate the benchmark, it just makes the software vendor a bit of an Intel stooge.
Of course it doesn't matter for home consumers because real target audience for MKL isn't desktop user applications or games. Try using VASP or COMSOL or Mathematica on an AMD CPU. Both MKL and CUDA are major "issues" in HPC that limit decisions in purchasing clusters.
I don't know about VASP because it's proprietary, but I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes. I don't remember the numbers, but the high core count was rather effective in reducing the communication costs with all free software.
https://www.archer2.ac.uk/ will run a lot of that sort of thing. I think at least cp2k and CASTEP are included in the benchmarks, but they're listed somewhere.
When you're actually a part of conversations on purchasing a cluster (which I gather from the way you talk, you haven't been a part of) which cost $400k ~ $1M for a smallish/mid-sized, arguments like "I don't remember the numbers" "they're listed somewhere", "there's this some other random DFT code" aren't effective. The hard fact is (which I hate), you get results faster with MKL on Intel than any other alternatives. This is more so with the proprietary software that are golden standards.
> I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes
And what's that comparison supposed to tell us, aside from the obvious fact that MPI introduces latency? That's just about the number of cores, not the performance of a each core. You need to compare 64-core AMD node against a 64-core Intel node.
I don't remember, because the measurements on the £1M purchase was maybe five years ago, but they taught a useful lesson. I didn't see figures in what I was responding to. If I'd had more influence on the purchase, as opposed to observing the process, we wouldn't have ended up with a pure sandybridge system, which was a mistake. Anyhow, my all-free-software version of cp2k was faster on it than an all-Intel version on slightly faster CPUs on an otherwise equivalent cluster. I measured and paid attention to the MPI, which benefited everything using alltoallv. The large core-count AMD boxes were simply a better bet for the range of work on a university HPC system. It's not as if most codes topped out an arithmetic intensity and there was a serious problem with serial performance, even if MKL had been significantly better than the free libraries, which it wasn't.
For a recent exercise, spending rather more money on AMD CPUs for the UK Tier 1 system, look at the Archer2 reference and benchmarking for it. It's expected to run large amounts of VASP-like code; www.archer.ac.uk publishes usage of the current system. Circumstances differ, and I'm pointing out contrary experience, understanding the measurements and what determined them.
Well, it depends. If you are a big enough outlet or a hyperscaler, you should care about raw-performance because if a sufficient amount of people make the same choice as you, the software vendor will adapt and you will end up with the competitive advantage.
Most Python math and ML libraries are compiled using GCC or LLVM and are linked against against cBLAS or OpenBLAS[1]. The latter is highly performant on both Intel and AMD (and other platforms). Some libraries are optionally compiled against the MKL, in particular those that are distributed with the Anaconda Python distribution.
MKL's benchmark performance requires AVX + FMA.
3.5 GHz * (4 add + 4 multiply) * 2 fma/cycle = 56 peak GFLOPS.
To exceed 50 GFLOPS without them would imply the CPU ran at 12.5 GHz.
OpenBLAS, on the other hand, actually performed poorly because it was limited to SSE thanks to a bug preventing the CPU from being recognized as Zen2.
I checked the instructions in with perf and it is using an SSE code path. Also, as reported elsewhere, MKL_DEBUG_CPU_TYPE=5 does not enable AVX2 support as it used to do.
The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.
I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:
So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.
Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.
Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.
FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.
It's disappointing that MKL doesn't use optimized code paths on the 3700X.
I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.
No, I don't. The 32-core AWS systems must be Epyc, so I'll try benchmarking there.
When OpenBLAS identifies the arch, it is competitive with MKL in single threaded performance, at least for matrices with a couple hundred rows and columns or more.
But MKL truly shines with multiple threads, so scaling on a 32 core system would be interesting to look at.
I'd have to disagree, whenever I install new pip/conda packages, an MKL version is downloaded by default (I've never seen any *BLAS version by default in my life).
NumPy binary wheels on PyPI (i.e. from pip) are built with OpenBLAS. NumPy from the official Anaconda, Inc. conda channel defaults to MKL. The conda-forge channel defaults to OpenBLAS.
Money quote: "... on an AMD computer then you may set the environment variable MKL_DEBUG_CPU_TYPE=5."
When run on an AMD, any program built with Intel's compiler should have the environment variable set. I don't think there is any downside to leaving it on all the time, unless you are measuring how badly Intel has tried to cripple your AMD performance.
That's definitely possible (it probably checks that the manufacturer ID is GenuineIntel), but nobody wants to distribute patched MKL versions, because it most likely violates the MKL license.
It may even be easier to replace the function altogether with LD_PRELOAD.
By the way, if you want make this permanent in a binary, there is no need to set LD_PRELOAD all the time. You could just add a DT_NEEDED entry to the dynamic section. E.g. something like:
I’m sure their justification is that (1) they have no obligation to help AMD, and (2) how could you guarantee AMD implements CPUID the same as Intel (as in: what if AMD implements a feature bit differently?)
Of course, the second one makes no sense as x86 programs run just as fine on AMD as Intel with the same feature set (albeit at different speeds)
You distribute a binary patch for a given MKL release, have your package download the official MKL release and then patch it using the binary patch. Nobody suffers, everyone wins.
Exactly what I was thinking. For libs like MKL it should even be feasible to have a database of known binary releases with a patch offset so you can speed up your scientific application using a little patch tool. But even for executables my guess is that it should be relatively easy to programmatically find the relevant check and patch it, unless Intel starts to deliberately obfuscate it, like copy protection checks in games.
How on is the end user supposed to know to do that, know when to do that, or know what to do when the next update to Intel’s compiler that puts cpu-type-5 on the pessimal code path?
Is there something I can add to my bashrc to handle that?
If the environment variable still works, it could be set by a distribution (esp scientific ones) or in your .bashrc: eg. `export VAR=5`.
If that fails, as OP implies, you can still override the function by creating a tiny library with it always returning true. On GNU/Linux systems, you do that using LD_PRELOAD. Perhaps someone's already done that so you just need to download, compile and set it.
Sorry for the lack of specifics, but I do not deal with these libraries, yet I was still hoping to point you in the right direction.
Not any program built using ICC, rather, any program using Intel’s MKL, a set of basic linear algebra libraries (BLAS). This is typically limited to scientific computing applications and libraries.
The statement you were responding to is only referring to the Intel mkl, though. There are many other blas libraries. Where you making a more general statement about some set kf blas implementations? Or the blas interface in general perhaps?
I work on CFD software. We're well aware of this in my work, but the reality is that all our big corporate clients use Intel hardware. We already tell people to set those environment variables in our documentation.
> Avoid the Intel compiler. There are other compilers with similar or better performance.
This is not really true IMO, but even as an aside, the Intel compiler has the enormous advantage of being available cross platform. So we can use it on Linux and Windows, and provides MPI cross platform. We upgrade fairly regularly and that provides us with less work.
My own tests found that PGI compiler performance was worse than Intel for C++, and that now appears to have been discontinued on Windows anyway with NVidia's new HPC compiler suite replacing it. GNU can run everywhere, but performance is around 2.5x worse on Linux for our application use case because it doesn't perform many of the optimisations that Intel does. We use MSVC on Windows just because everyone can have a license, and performance is much worse.
The other thing is that MKL is pretty stable and gets updated. If I use an open source BLAS/LAPACK implementation - sure, it works, and it may even give better performance! But it's not guaranteed to get updates beyond a couple of years, and plenty of implementations are also only partial. We pay Intel a lot of money for the lack of hassle, basically.
So which are the optimizations the Intel compiler does which GCC can't is asked? I could guess at the reason for a factor of two, but what does the detailed profiling say with equivalent compiler flags? I can also say that GCC is a factor of two better on SKX on a Fortran benchmark, and came out about the same over the collection that's from when profile-directed. The usual reason for the Intel compiler appearing to win much is incorrect-by-default maths optimization allowing more vectorization.
I don't know about MKL stability, but reliability definitely isn't something I associate with the Intel Fortran compiler (or MPI) in research computing support.
that's an incredible margin, and sounds suspiciously like they didn't enable optimizations on gcc, or set icc to optimize for a specific processor and gcc to generic, or something like that.
At which point, its not just the compiler (which GCC is pretty good at), but also the threading implementation (which I can believe that GCC has an inferior Windows-threading OpenMP implementation).
I don't really use either tool. But OpenMP + GCC on Windows doesn't sound like it'd be fast to me.
--------
MSVC only has OpenMP 2.0 support (OpenMP is all the way up to 5.0 now).
OpenMP, despite being a common interface, also is pretty reliant on many implementation details for performance. One way of doing things on GCC could be faster than another, while it could be the opposite on ICC. Its quite possible that their codebase is tailored for ICC, and that recompiling it under GCC (with a different OpenMP implementation) results in weaker performance.
I wouldn't expect 250% performance difference in normal code however. GCC and ICC aren't that far off under typical circumstances.
The mythology surrounding the Intel tools and libraries really ought to die. It's bizarre seeing people deciding they must use MKL rather than the linear algebra libraries on which AMD has been working hard to optimize for their hardware (and possibly other hardware incidentally). Similarly for compiler code generation.
Free BLASs are pretty much on a par with MKL, at least for large dimension level 3 in BLIS's case, even on Haswell. For small matrices MKL only became fast after libxsmm showed the way. (I don't know about libxsmm on current AMD hardware, but it's free software you can work on if necessary, like AMD have done with BLIS.) OpenBLAS and BLIS are infinitely better performing than MKL in general because they can run on all CPU architectures (and BLIS's plain C gets about 75% of the hand-written DGEMM kernel's performance).
The differences between the implementations are comparable with the noise in typical HPC jobs, even if performance was entirely dominated by, say, DGEMM (and getting close to peak floating point intensity is atypical). On the other hand, you can see a factor of several difference in MPI performance in some cases.
Not really. No one outside of specialized applications like HPC will use Intel's compiler for their software. The general public seeing SPEC benchmark figures between gcc AMD and icc Intel may be surprised when they that Intel CPU doesn't perform as well as expected vs AMD when running generic code.
5-10 years ago the Intel C compiler produced significantly faster code than gcc (and clang was even worse back then), so there was a bigger reason to use it back then.
That was the story 10 years ago as well, yet I have never managed to find an open source program where the Intel compiler has produced faster code than gcc back then, too.
gcc has always produced faster code for at least 15 years. In fact, it is the Intel compiler which has caught up in the most recent version.
I got faster (10-20%) results with icc on an abstract game minimax AI bot back then (i.e. something similar to a chess engine). Even more so when taking advantage of PGO. Over time GCC caught up.
By nature, this code had no usage of floating point in its critical path.
For what sort of application? I ran benchmarks of my own scientific code for doing particle-particle calculations and with -march=native I could get 2.5x better performance with Intel vs GCC.
One thing I found that you do have to be careful with though is ensuring that Intel uses IEEE floating point precision, because by default it's less accurate than GCC. This causes issues in Eigen sometimes, we ran into an issue recently after upgrading compiler where suddenly the results changed and it was because someone had forgotten to set 'fp-model' to 'strict'
If Intel is using floating point math shortcuts you can replicate it with -Ofast when using gcc.
It goes without saying that you should use -O3 (or -O2 for some rare cases) otherwise. I am mentioning it just in case because 2.5x slower sounds so exotic to me that the first intuition is that you're omitting important optimization flags when using GCC. GCC was faster than Intel on everything I tried in the past.
Once upon a time, Oracle used Intel C Compiler (ICC) to compile Oracle RDBMS on some platforms [1].
I don't know if Oracle is still using ICC for that or not. (If you download Oracle RDBMS, and check the binaries, you will be able to work it out. I can't be bothered.)
There can be various traces left in strings, the symbol table, etc
Many compilers statically link implementations of various built-in functions into the resulting executable, and that can result in different symbol table entries
...and that despite not being anywhere near as aggressive with exploiting UB as gcc or clang, which shows that backend-based optimisations like instruction selection, scheduling, and register allocation are far more valuable (and predictable).
I don't think anyone disputes that? Most optimizing compiler literature doesn't even mention language semantics, the gains there are very much last-ditch rather than necessary.
I can't even find benchmarks of ICC vs a current GCC but they were pretty even the best part of a decade ago. GCC is a mess compared to LLVM but it's quick.
I've never used Unity, but Unreal Engine is heavily tied into the Visual Studio (proper, not Code) workflow, including the Microsoft C++ Compiler toolchain and all 30GB+ of it's friends.
Unreal uses the native compiler for the target platform. Windows this is msvc. Modern consoles are all clang forks. Linux is the only exception where I think they depend on clang not gcc.
Most high performance software on super computers uses the Intel C and Fortran compilers, and much engineering and scientific software on workstations uses the Intel Maths Kernel Library (MKL) for high performance linear algebra.
Now that AMD EPYC processors are powering a lot of next generation super-computer clusters, we're going to have to figure out some workarounds!
I just compiled tensorflow on amd epyc and had no idea https://github.com/oneapi-src/oneDNN was actually MKL... now Im wondering if I even am getting all that power
I took a reversing course some years ago and during the first part we learned how to identify the compiler using common patterns. Long story short, the Intel compiler did a phenomenally amazing job optimizing. This was 10 years ago so things may be different now.
10 years ago LLVM was a baby and GCC was still on version 4. Intel probably have an advantage in areas where people pay them for it but GCC and LLVM are excellent compilers today.
Anecdotally, (ignoring that I'm still not sure whether to trust it or not) Intel stopped developing IACA and suggested (but not recommended) LLVM's MCA - which does suggest a changing of the guard in some way.
Edit: the link I posted follows Agner's advice from the bottom of OPs link. However I think the extra information that it adds is that Zen2 Threadrippers outpaced then-current Intel's top contender. Once Zen3 and Intel's 11th Gen become available, repeating this benchmarks would be very valuable.
Thank you! I wasn't aware of this.
But this is only a replacement for libm (i.e. basic trig, exp function), not the matrix-orientated BLAS, LAPACK and SCALAPACK routines that scientific codes spend >90% of their time.
I think you meant Intel Compiler? Yes. Intel Compiler consistently produces the highest performing binaries on Intel processors, often by a big margin. Intel MKL used to be the highest performing math library, and may still be so. As a result, most performance critical software, such as scientific applications, are compiled using ICC.
This is an overstatement. ICC consistently compiles the slowest and produces the largest binaries. It also defaults to something close to -ffast-math, which may or may not be appropriate. If your app benefits from aggressive inlining and vectorization at the expense of potentially huge increases in code size, ICC is likely to do well for you. However, I've seen lots of cases where well-vectorized code is faster with GCC or Clang, including some very important cases using Intel intrinsics. (Several such cases reported to/acknowledged by Intel; some have been fixed over the years, but these observations are not uncommon.)
I have been hearing about the superiority of Intel's compiler for a couple of decades now. Back when GCC was a tiny baby compared to what it is now, and when Clang/LLVM didn't even exist.
I wonder if this Intel compiler 'superiority' is still the case today, or if this is just a meme at this point.
For matrix manipulation based Fortran scientific codes, ifort/MKL can give +30% compared to gfortran. It's difficult to disentangle where the speedup comes from, but certainly as jedbrown aludes to, the Intel compilers seem to make a better go of poorly optimised / badly written code.
For C based software, its a much closer run thing, and often sticking with GCC avoids weird segfaults when mixing Intel and GCC-compiled Linux libraries.
Where do you typically see lack of inlining and vectorization with GCC? I'm curious because most times people have said GCC wouldn't vectorize code that I've been able to try, it would, at least if allowed -ffast-math a la Intel (as in BLIS now).
The HPC codes I worked on we would compile with gcc, clang, icc and whatever vendor compiler was installed (Cray, PGI, something even worse). Then we'd benchmark the resulting binaries and make a recommendation based on speed. (Assuming the compiled binaries gave the correct results, which would sometimes fail and trigger further debugging to find out if we had undefined behaviors (or implementation defined) or had managed to find a compiler bug. For codes that are memory-bandwidth dominated the results are pretty much a toss-up. For compute bound codes intel would often win.
You can do the same when your machine has non-intel CPUs that are supported by a lot of compilers. If you are on power9 or arm the compiler list gets shorter. And a lot of supercomputers start to contain accelerators (often, but not always Nvidia GPUs) in which case there is often only one supported compiler and you have to rewrite code until that compiler is happy and produces fast code.
It's a depressingly common choice for educational installations. Folks are trained to use ICC instead of GCC and then they keep using ICC when they leave school.
Toward the end of the article, the several lawsuits and FTC actions are discussed. The end result of them is a disclaimer on the Intel compiler that it's not optimized for non-Intel processors and that Intel can't artificially hurt AMD performance (but it apparently has no obligation to support unique AMD optimizations either).
> (but it apparently has no obligation to support unique AMD optimizations either)
It's a bit worse than that. Intel has no obligation to support optimizations that aren't unique to AMD; they're allowed to disable SIMD extensions that AMD processors declare support for, while at the same time using all of those SIMD extensions on Intel CPUs. They just have to include the disclaimer that their compiler and libraries may be doing this.
Why is it just the compiler maker’s job to report it may (read: will) underperform on AMD and not also the program developer too? If I paid for software that performed worse on AMD because it deliberately hobbled itself (and was not informed), I’d want a refund.
It’s straight up anti-competitive, but consumers aren’t smart enough to understand that it’s a problem; A consumer just sees biased benchmarks that show Intel outperforming AMD, and then choose Intel.
> that Intel can't artificially hurt AMD performance (but it apparently has no obligation to support unique AMD optimizations either).
As far as I understand it quite the opposite, it explicitly mentions that it may not apply "optimizations that are not unique to Intel" to other processors. It wont select the optimal code path unless the CPU vendor ID is set to GenuineIntel and fall back to the worst path your compile settings include.
A serious question - how many common benchmark packages are compiled by ICC or uses Intel MKL? I hope the number is limited, otherwise all the benchmarks published by mainstream PC reviewers are potentially biased. If there's a serious ICC-in-benchmark problem, then only Phoronix's Linux benchmarks are trustworthy - the majority of benchmarks on Phoronix uses free and open source compilers and testsuits, with known versions, build parameters and optimization levels. Thanks Michael Larabel for his service for the community.