Intel's “cripple AMD” function (2019)

segfaultbuserr · on Aug 28, 2020

> Never rely on benchmark tests unless the benchmarking code is known to be open source and compiled without using any Intel tools.

A serious question - how many common benchmark packages are compiled by ICC or uses Intel MKL? I hope the number is limited, otherwise all the benchmarks published by mainstream PC reviewers are potentially biased. If there's a serious ICC-in-benchmark problem, then only Phoronix's Linux benchmarks are trustworthy - the majority of benchmarks on Phoronix uses free and open source compilers and testsuits, with known versions, build parameters and optimization levels. Thanks Michael Larabel for his service for the community.

wtallis · on Aug 29, 2020

This concern really only applies to synthetic benchmarks (stuff like SPEC CPU). If you're testing a commercially available application or game as delivered to consumers, this issue does not invalidate the benchmark, it just makes the software vendor a bit of an Intel stooge.

tagrun · on Aug 29, 2020

Of course it doesn't matter for home consumers because real target audience for MKL isn't desktop user applications or games. Try using VASP or COMSOL or Mathematica on an AMD CPU. Both MKL and CUDA are major "issues" in HPC that limit decisions in purchasing clusters.

gnufx · on Aug 29, 2020

I don't know about VASP because it's proprietary, but I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes. I don't remember the numbers, but the high core count was rather effective in reducing the communication costs with all free software.

https://www.archer2.ac.uk/ will run a lot of that sort of thing. I think at least cp2k and CASTEP are included in the benchmarks, but they're listed somewhere.

tagrun · on Aug 30, 2020

When you're actually a part of conversations on purchasing a cluster (which I gather from the way you talk, you haven't been a part of) which cost $400k ~ $1M for a smallish/mid-sized, arguments like "I don't remember the numbers" "they're listed somewhere", "there's this some other random DFT code" aren't effective. The hard fact is (which I hate), you get results faster with MKL on Intel than any other alternatives. This is more so with the proprietary software that are golden standards.

> I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes

And what's that comparison supposed to tell us, aside from the obvious fact that MPI introduces latency? That's just about the number of cores, not the performance of a each core. You need to compare 64-core AMD node against a 64-core Intel node.

gnufx · on Sept 1, 2020

I don't remember, because the measurements on the £1M purchase was maybe five years ago, but they taught a useful lesson. I didn't see figures in what I was responding to. If I'd had more influence on the purchase, as opposed to observing the process, we wouldn't have ended up with a pure sandybridge system, which was a mistake. Anyhow, my all-free-software version of cp2k was faster on it than an all-Intel version on slightly faster CPUs on an otherwise equivalent cluster. I measured and paid attention to the MPI, which benefited everything using alltoallv. The large core-count AMD boxes were simply a better bet for the range of work on a university HPC system. It's not as if most codes topped out an arithmetic intensity and there was a serious problem with serial performance, even if MKL had been significantly better than the free libraries, which it wasn't.

For a recent exercise, spending rather more money on AMD CPUs for the UK Tier 1 system, look at the Archer2 reference and benchmarking for it. It's expected to run large amounts of VASP-like code; www.archer.ac.uk publishes usage of the current system. Circumstances differ, and I'm pointing out contrary experience, understanding the measurements and what determined them.

formerly_proven · on Aug 29, 2020

The go-to Intel-crushing benchmark these days is Cinebench R20, which runs on Intel's Embree raytracer.

segfaultbuserr · on Aug 29, 2020

So Cinebench R20 is a suspect?! That's not good...

gruez · on Aug 29, 2020

But is that compiled by ICC?

sudosysgen · on Aug 29, 2020

Well, it depends. If you are a big enough outlet or a hyperscaler, you should care about raw-performance because if a sufficient amount of people make the same choice as you, the software vendor will adapt and you will end up with the competitive advantage.

bitL · on Aug 29, 2020

Most of Python math libraries? So a majority of Machine Learning folks is affected.

roseway4 · on Aug 29, 2020

Most Python math and ML libraries are compiled using GCC or LLVM and are linked against against cBLAS or OpenBLAS[1]. The latter is highly performant on both Intel and AMD (and other platforms). Some libraries are optionally compiled against the MKL, in particular those that are distributed with the Anaconda Python distribution.

[1] https://numpy.org/doc/stable/reference/routines.linalg.html?...

erwincoumans · on Aug 29, 2020

PyTorch uses MKL https://www.google.com/amp/s/amp.reddit.com/r/MachineLearnin... and the workaround for AMD has been disabled by Intel

celrod · on Aug 29, 2020

I think MKL actually fixed Zen performance. That is, the workaround no longer makes any difference because it is no longer needed.

Small matrix multiply benchmarks on a Zen2 (Ryzen 7 4700U), featuring MKL 2020.1.216+0, OpenBLAS, and Eigen: https://gist.github.com/stillyslalom/bd916e3d26b4531364676ac...

MKL's benchmark performance requires AVX + FMA. 3.5 GHz * (4 add + 4 multiply) * 2 fma/cycle = 56 peak GFLOPS. To exceed 50 GFLOPS without them would imply the CPU ran at 12.5 GHz.

OpenBLAS, on the other hand, actually performed poorly because it was limited to SSE thanks to a bug preventing the CPU from being recognized as Zen2.

microtonal · on Aug 29, 2020

I think MKL actually fixed Zen performance. That is, the workaround no longer makes any difference because it is no longer needed.

Odd. I am trying on my 3700X and it is definitely not using AVX, FMA or AVX2 code paths. Intel MKL 2020 update 2:

     ldd  ~/git/sticker2/target/release/sticker2  | grep mkl_intel
     libmkl_intel_lp64.so => /nix/store/jpjwkkv1dqk4nn8swjzr5qqzp0dpzk2f-mkl-2020.2.254/lib/libmkl_intel_lp64.so (0x00007fe786862000)

I checked the instructions in with perf and it is using an SSE code path. Also, as reported elsewhere, MKL_DEBUG_CPU_TYPE=5 does not enable AVX2 support as it used to do.

stillyslalom · on Aug 29, 2020

Comparing OpenBLAS and MKL with `peakflops` in Julia, there's definitely an advantage for MKL:

    julia> using LinearAlgebra

    julia> BLAS.vendor()
    :openblas64

    julia> BLAS.set_num_threads(1)

    julia> peakflops()
    3.9023447970402664e10


    julia> using LinearAlgebra
    
    julia> BLAS.vendor()
    :mkl
    
    julia> BLAS.set_num_threads(1)
    
    julia> peakflops()
    4.8113846984735275e10

That's close to the ~50 Gflops I saw in @celrod's benchmarks.

microtonal · on Aug 30, 2020

The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.

I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:

    $ ./mt-dgemm 1000 | grep GFLOP
    GFLOP/s rate:         69.124168 GF/s

Most-used function is

   mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen

So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.

microtonal · on Aug 30, 2020

Found the discrepancy. I use single precision in PyTorch. When I benchmark sgemm, the SSE code path is selected.

Conclusion: MKL detects Zen now, but currently only implements a Zen code path for dgemm and not for sgemm. To get good performance for sgemm, you have to fake being an Intel CPU.

Edit, longer description: https://github.com/pytorch/builder/issues/504

celrod · on Aug 29, 2020

Hmm.

FWIW, on my [Skylake/Cascadelake]-X Intel systems, Intel's compilers performed well, almost always outperforming GCC and Clang. But on Zen, their performance was terrible. So I was happy to see that MKL, unlike the compilers, did not appear to gimp AMD.

It's disappointing that MKL doesn't use optimized code paths on the 3700X.

I messaged the person who actually ran the benchmarks and owns the laptop, asking them to chime in with more information. I'm just the person who wrote that benchmark suite.

microtonal · on Aug 30, 2020

It seems I have found the issue. We were both right. MKL now uses a Zen-optimized kernel for dgemm, but not (yet?) for sgemm. More details:

https://github.com/pytorch/builder/issues/504

gnufx · on Aug 29, 2020

If OpenBLAS' CPU detection fails, you can force it with an environment variable, but why omit AMD's implementation?

shaklee3 · on Aug 29, 2020

Do you know if this applies to epyc as well?

celrod · on Aug 29, 2020

No, I don't. The 32-core AWS systems must be Epyc, so I'll try benchmarking there.

When OpenBLAS identifies the arch, it is competitive with MKL in single threaded performance, at least for matrices with a couple hundred rows and columns or more. But MKL truly shines with multiple threads, so scaling on a 32 core system would be interesting to look at.

gnufx · on Aug 29, 2020

You can see BLIS on Intel's home turf at https://github.com/flame/blis/blob/master/docs/Performance.m... (52-core SKX) and compare with OpenBLAS on 32-core Zen1. (Multithreaded BLAS isn't typically used in HPC, where the parallelism is elsewhere.)

bitL · on Aug 29, 2020

I'd have to disagree, whenever I install new pip/conda packages, an MKL version is downloaded by default (I've never seen any *BLAS version by default in my life).

ddavis · on Aug 29, 2020

NumPy binary wheels on PyPI (i.e. from pip) are built with OpenBLAS. NumPy from the official Anaconda, Inc. conda channel defaults to MKL. The conda-forge channel defaults to OpenBLAS.

Explained in the NumPy docs here: https://numpy.org/install/

_pd19 · on Aug 29, 2020

> Most Python math and ML libraries are compiled using GCC or LLVM and are linked against against cBLAS or OpenBLAS[1]

This is definitely understating MKLs market share in BLAS. It's an extremely common BLAS backing for Python libraries.

tagrun · on Aug 29, 2020

MKL is a real problem in HPC, and it does play a significant role in decisions when purchasing clusters.

In addition, software like Mathematica/Matlab which use only MKL can affect decisions even for office workstations.

ncmncm · on Aug 28, 2020

Money quote: "... on an AMD computer then you may set the environment variable MKL_DEBUG_CPU_TYPE=5."

When run on an AMD, any program built with Intel's compiler should have the environment variable set. I don't think there is any downside to leaving it on all the time, unless you are measuring how badly Intel has tried to cripple your AMD performance.

g42gregory · on Aug 29, 2020

My understanding is that that flag is gone, as of couple of months ago. Intel “fixed” it.

microtonal · on Aug 29, 2020

Yes, starting with MKL 2020.01 release. The Wikipedia page has more information and references:

https://en.wikipedia.org/wiki/Math_Kernel_Library#Performanc...

This is quite bad, since a lot of software relies on Intel MKL as the default BLAS implementation (e.g. PyTorch binaries).

rasz · on Aug 29, 2020

Why not patch out the CPUID check as a post compilation step?

microtonal · on Aug 29, 2020

That's definitely possible (it probably checks that the manufacturer ID is GenuineIntel), but nobody wants to distribute patched MKL versions, because it most likely violates the MKL license.

It may even be easier to replace the function altogether with LD_PRELOAD.

microtonal · on Aug 29, 2020

Indeed works. A simple trace reveals that the function is called mkl_serv_intel_cpu_true().

Make a file with the following content:

    int mkl_serv_intel_cpu_true() {
      return 1;
    }

Compile

    gcc -shared -o libfake.so fake.c

Run

    LD_PRELOAD=libfake.so yourprogram

And it uses the optimized AVX codepaths.

Disclaimer: may not be legal in your country. I take no responsibility.

microtonal · on Aug 30, 2020

By the way, if you want make this permanent in a binary, there is no need to set LD_PRELOAD all the time. You could just add a DT_NEEDED entry to the dynamic section. E.g. something like:

    patchelf --add-needed libfakeintel.so yourbinary

ashleyn · on Aug 29, 2020

Wow. I wasn't quite expecting something as simple as "if CPU is not intel, make everything worse."

colejohnson66 · on Aug 30, 2020

I’m sure their justification is that (1) they have no obligation to help AMD, and (2) how could you guarantee AMD implements CPUID the same as Intel (as in: what if AMD implements a feature bit differently?)

Of course, the second one makes no sense as x86 programs run just as fine on AMD as Intel with the same feature set (albeit at different speeds)

krageon · on Aug 29, 2020

You distribute a binary patch for a given MKL release, have your package download the official MKL release and then patch it using the binary patch. Nobody suffers, everyone wins.

rasz · on Aug 29, 2020

No need to patch MKL, just your own binaries post compile.

iforgotpassword · on Aug 29, 2020

Exactly what I was thinking. For libs like MKL it should even be feasible to have a database of known binary releases with a patch offset so you can speed up your scientific application using a little patch tool. But even for executables my guess is that it should be relatively easy to programmatically find the relevant check and patch it, unless Intel starts to deliberately obfuscate it, like copy protection checks in games.

clankyclanker · on Aug 29, 2020

How on is the end user supposed to know to do that, know when to do that, or know what to do when the next update to Intel’s compiler that puts cpu-type-5 on the pessimal code path?

Is there something I can add to my bashrc to handle that?

necovek · on Aug 29, 2020

If the environment variable still works, it could be set by a distribution (esp scientific ones) or in your .bashrc: eg. `export VAR=5`.

If that fails, as OP implies, you can still override the function by creating a tiny library with it always returning true. On GNU/Linux systems, you do that using LD_PRELOAD. Perhaps someone's already done that so you just need to download, compile and set it.

Sorry for the lack of specifics, but I do not deal with these libraries, yet I was still hoping to point you in the right direction.

roseway4 · on Aug 29, 2020

Not any program built using ICC, rather, any program using Intel’s MKL, a set of basic linear algebra libraries (BLAS). This is typically limited to scientific computing applications and libraries.

sudosysgen · on Aug 29, 2020

A lot of software depends on a BLAS as a dependency somewhere.

simtel20 · on Aug 29, 2020

The statement you were responding to is only referring to the Intel mkl, though. There are many other blas libraries. Where you making a more general statement about some set kf blas implementations? Or the blas interface in general perhaps?

physicsguy · on Aug 29, 2020

I work on CFD software. We're well aware of this in my work, but the reality is that all our big corporate clients use Intel hardware. We already tell people to set those environment variables in our documentation.

> Avoid the Intel compiler. There are other compilers with similar or better performance.

This is not really true IMO, but even as an aside, the Intel compiler has the enormous advantage of being available cross platform. So we can use it on Linux and Windows, and provides MPI cross platform. We upgrade fairly regularly and that provides us with less work.

My own tests found that PGI compiler performance was worse than Intel for C++, and that now appears to have been discontinued on Windows anyway with NVidia's new HPC compiler suite replacing it. GNU can run everywhere, but performance is around 2.5x worse on Linux for our application use case because it doesn't perform many of the optimisations that Intel does. We use MSVC on Windows just because everyone can have a license, and performance is much worse.

The other thing is that MKL is pretty stable and gets updated. If I use an open source BLAS/LAPACK implementation - sure, it works, and it may even give better performance! But it's not guaranteed to get updates beyond a couple of years, and plenty of implementations are also only partial. We pay Intel a lot of money for the lack of hassle, basically.

gnufx · on Aug 29, 2020

So which are the optimizations the Intel compiler does which GCC can't is asked? I could guess at the reason for a factor of two, but what does the detailed profiling say with equivalent compiler flags? I can also say that GCC is a factor of two better on SKX on a Fortran benchmark, and came out about the same over the collection that's from when profile-directed. The usual reason for the Intel compiler appearing to win much is incorrect-by-default maths optimization allowing more vectorization.

I don't know about MKL stability, but reliability definitely isn't something I associate with the Intel Fortran compiler (or MPI) in research computing support.

physicsguy · on Sept 9, 2020

I found that the common subexpression elimination performance was significantly better than that in GCC for one thing

jstanley · on Aug 29, 2020

> the Intel compiler has the enormous advantage of being available cross platform

How does this advantage not apply to gcc? Isn't gcc the most cross-platform compiler ever?

gpapilion · on Aug 29, 2020

I think it’s a matter of features and performance. The poster says he can use gcc but the performance is 2.5x slower.

Hello71 · on Aug 29, 2020

that's an incredible margin, and sounds suspiciously like they didn't enable optimizations on gcc, or set icc to optimize for a specific processor and gcc to generic, or something like that.

dragontamer · on Aug 29, 2020

Hmm, OpenMP is a wildcard here.

At which point, its not just the compiler (which GCC is pretty good at), but also the threading implementation (which I can believe that GCC has an inferior Windows-threading OpenMP implementation).

I don't really use either tool. But OpenMP + GCC on Windows doesn't sound like it'd be fast to me.

--------

MSVC only has OpenMP 2.0 support (OpenMP is all the way up to 5.0 now).

OpenMP, despite being a common interface, also is pretty reliant on many implementation details for performance. One way of doing things on GCC could be faster than another, while it could be the opposite on ICC. Its quite possible that their codebase is tailored for ICC, and that recompiling it under GCC (with a different OpenMP implementation) results in weaker performance.

I wouldn't expect 250% performance difference in normal code however. GCC and ICC aren't that far off under typical circumstances.

gnufx · on Aug 29, 2020

The mythology surrounding the Intel tools and libraries really ought to die. It's bizarre seeing people deciding they must use MKL rather than the linear algebra libraries on which AMD has been working hard to optimize for their hardware (and possibly other hardware incidentally). Similarly for compiler code generation.

Free BLASs are pretty much on a par with MKL, at least for large dimension level 3 in BLIS's case, even on Haswell. For small matrices MKL only became fast after libxsmm showed the way. (I don't know about libxsmm on current AMD hardware, but it's free software you can work on if necessary, like AMD have done with BLIS.) OpenBLAS and BLIS are infinitely better performing than MKL in general because they can run on all CPU architectures (and BLIS's plain C gets about 75% of the hand-written DGEMM kernel's performance).

The differences between the implementations are comparable with the noise in typical HPC jobs, even if performance was entirely dominated by, say, DGEMM (and getting close to peak floating point intensity is atypical). On the other hand, you can see a factor of several difference in MPI performance in some cases.

smartmic · on Aug 28, 2020

Related: https://news.ycombinator.com/item?id=21732902

gnopgnip · on Aug 28, 2020

Even if the compilers are biased, isn't it reflective of what users would experience because most software is made with biased compilers?

throwaway5792 · on Aug 28, 2020

Not really. No one outside of specialized applications like HPC will use Intel's compiler for their software. The general public seeing SPEC benchmark figures between gcc AMD and icc Intel may be surprised when they that Intel CPU doesn't perform as well as expected vs AMD when running generic code.

FartyMcFarter · on Aug 28, 2020

5-10 years ago the Intel C compiler produced significantly faster code than gcc (and clang was even worse back then), so there was a bigger reason to use it back then.

nzmlPA · on Aug 29, 2020

That was the story 10 years ago as well, yet I have never managed to find an open source program where the Intel compiler has produced faster code than gcc back then, too.

gcc has always produced faster code for at least 15 years. In fact, it is the Intel compiler which has caught up in the most recent version.

FartyMcFarter · on Aug 29, 2020

I got faster (10-20%) results with icc on an abstract game minimax AI bot back then (i.e. something similar to a chess engine). Even more so when taking advantage of PGO. Over time GCC caught up.

By nature, this code had no usage of floating point in its critical path.

I haven't bothered with icc in years though.

physicsguy · on Aug 29, 2020

For what sort of application? I ran benchmarks of my own scientific code for doing particle-particle calculations and with -march=native I could get 2.5x better performance with Intel vs GCC.

One thing I found that you do have to be careful with though is ensuring that Intel uses IEEE floating point precision, because by default it's less accurate than GCC. This causes issues in Eigen sometimes, we ran into an issue recently after upgrading compiler where suddenly the results changed and it was because someone had forgotten to set 'fp-model' to 'strict'

bluecalm · on Aug 29, 2020

If Intel is using floating point math shortcuts you can replicate it with -Ofast when using gcc.

It goes without saying that you should use -O3 (or -O2 for some rare cases) otherwise. I am mentioning it just in case because 2.5x slower sounds so exotic to me that the first intuition is that you're omitting important optimization flags when using GCC. GCC was faster than Intel on everything I tried in the past.

skissane · on Aug 29, 2020

Once upon a time, Oracle used Intel C Compiler (ICC) to compile Oracle RDBMS on some platforms [1].

I don't know if Oracle is still using ICC for that or not. (If you download Oracle RDBMS, and check the binaries, you will be able to work it out. I can't be bothered.)

[1] https://www.businesswire.com/news/home/20030507005238/en/Ora...

thomasjudge · on Aug 29, 2020

How can you tell from a binary what compile was used to produce it?

skissane · on Aug 29, 2020

There can be various traces left in strings, the symbol table, etc

Many compilers statically link implementations of various built-in functions into the resulting executable, and that can result in different symbol table entries

userbinator · on Aug 28, 2020

...and that despite not being anywhere near as aggressive with exploiting UB as gcc or clang, which shows that backend-based optimisations like instruction selection, scheduling, and register allocation are far more valuable (and predictable).

mhh__ · on Aug 29, 2020

I don't think anyone disputes that? Most optimizing compiler literature doesn't even mention language semantics, the gains there are very much last-ditch rather than necessary.

I can't even find benchmarks of ICC vs a current GCC but they were pretty even the best part of a decade ago. GCC is a mess compared to LLVM but it's quick.

mey · on Aug 28, 2020

I'd be curious to know what compiler Unity and Unreal Engine are using.

Alupis · on Aug 28, 2020

I've never used Unity, but Unreal Engine is heavily tied into the Visual Studio (proper, not Code) workflow, including the Microsoft C++ Compiler toolchain and all 30GB+ of it's friends.

I'd suspect the same from Unity.

FartyMcFarter · on Aug 28, 2020

Both engines support platforms where Visual Studio is not available, right?

Danieru · on Aug 29, 2020

Unreal uses the native compiler for the target platform. Windows this is msvc. Modern consoles are all clang forks. Linux is the only exception where I think they depend on clang not gcc.

fomine3 · on Aug 29, 2020

nitpick: Maybe Xbox One is built by MSVC?

Jonnax · on Aug 28, 2020

Is there major software that uses I tell compiler?

jarvist · on Aug 28, 2020

Most high performance software on super computers uses the Intel C and Fortran compilers, and much engineering and scientific software on workstations uses the Intel Maths Kernel Library (MKL) for high performance linear algebra.

Now that AMD EPYC processors are powering a lot of next generation super-computer clusters, we're going to have to figure out some workarounds!

ramoz · on Aug 28, 2020

I just compiled tensorflow on amd epyc and had no idea https://github.com/oneapi-src/oneDNN was actually MKL... now Im wondering if I even am getting all that power

jcranmer · on Aug 29, 2020

The actual cpuid checking code is drilled from here: https://github.com/oneapi-src/oneDNN/blob/master/src/cpu/x64...

to here: https://github.com/oneapi-src/oneDNN/blob/master/src/cpu/x64...

It's using feature-flag checks, not family checks, so you shouldn't be affected if you're using oneDNN.

ramoz · on Aug 29, 2020

thank you!

freedomben · on Aug 28, 2020

I took a reversing course some years ago and during the first part we learned how to identify the compiler using common patterns. Long story short, the Intel compiler did a phenomenally amazing job optimizing. This was 10 years ago so things may be different now.

mhh__ · on Aug 29, 2020

10 years ago LLVM was a baby and GCC was still on version 4. Intel probably have an advantage in areas where people pay them for it but GCC and LLVM are excellent compilers today.

Anecdotally, (ignoring that I'm still not sure whether to trust it or not) Intel stopped developing IACA and suggested (but not recommended) LLVM's MCA - which does suggest a changing of the guard in some way.

ip26 · on Aug 28, 2020

I think this: https://developer.amd.com/amd-aocl/amd-math-library-libm/ is supposed to be the alternative to MKL for those applications.

dr_zoidberg · on Aug 28, 2020

No need to develop and alternative when you can trick MKL into not crippling AMD: https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...

Edit: the link I posted follows Agner's advice from the bottom of OPs link. However I think the extra information that it adds is that Zen2 Threadrippers outpaced then-current Intel's top contender. Once Zen3 and Intel's 11th Gen become available, repeating this benchmarks would be very valuable.

jarvist · on Aug 28, 2020

Thank you! I wasn't aware of this. But this is only a replacement for libm (i.e. basic trig, exp function), not the matrix-orientated BLAS, LAPACK and SCALAPACK routines that scientific codes spend >90% of their time.

ip26 · on Aug 28, 2020

I'm not personally familiar with those, but seems like BLAS, SCALAPACK, & others are also available:

https://developer.amd.com/amd-aocl/

satya71 · on Aug 28, 2020

I think you meant Intel Compiler? Yes. Intel Compiler consistently produces the highest performing binaries on Intel processors, often by a big margin. Intel MKL used to be the highest performing math library, and may still be so. As a result, most performance critical software, such as scientific applications, are compiled using ICC.

jedbrown · on Aug 28, 2020

This is an overstatement. ICC consistently compiles the slowest and produces the largest binaries. It also defaults to something close to -ffast-math, which may or may not be appropriate. If your app benefits from aggressive inlining and vectorization at the expense of potentially huge increases in code size, ICC is likely to do well for you. However, I've seen lots of cases where well-vectorized code is faster with GCC or Clang, including some very important cases using Intel intrinsics. (Several such cases reported to/acknowledged by Intel; some have been fixed over the years, but these observations are not uncommon.)

BLIS is used by AMD and is a good open alternative to MKL (for BLAS) across many platforms. https://github.com/flame/blis/blob/master/docs/Performance.m...

outworlder · on Aug 28, 2020

I have been hearing about the superiority of Intel's compiler for a couple of decades now. Back when GCC was a tiny baby compared to what it is now, and when Clang/LLVM didn't even exist.

I wonder if this Intel compiler 'superiority' is still the case today, or if this is just a meme at this point.

jarvist · on Aug 28, 2020

For matrix manipulation based Fortran scientific codes, ifort/MKL can give +30% compared to gfortran. It's difficult to disentangle where the speedup comes from, but certainly as jedbrown aludes to, the Intel compilers seem to make a better go of poorly optimised / badly written code.

For C based software, its a much closer run thing, and often sticking with GCC avoids weird segfaults when mixing Intel and GCC-compiled Linux libraries.

gnufx · on Aug 29, 2020

> This is an overstatement.

To be generous...

Where do you typically see lack of inlining and vectorization with GCC? I'm curious because most times people have said GCC wouldn't vectorize code that I've been able to try, it would, at least if allowed -ffast-math a la Intel (as in BLIS now).

kzrdude · on Aug 29, 2020

Can you explain "BLIS is used by AMD"? In what way do they use it?

jedbrown · on Aug 29, 2020

It's their official BLAS [1] since 2015 when they moved away from their proprietary ACML implementation [2].

[1]https://developer.amd.com/amd-aocl/blas-library/

[2] https://developer.amd.com/open-source-strikes-again-accelera...

gnufx · on Aug 29, 2020

Amusingly, OpenBLAS significantly beat the bought-in ACML, on DGEMM, over the six(?) generations of Opteron I had available. AMD learnt.

nzmlPA · on Aug 29, 2020

The fact that MKL is the highest performing library has nothing to do with the quality of icc's output.

It is a myth that icc produces faster binaries that may have been true 25 years ago.

shmerl · on Aug 28, 2020

So what are they compiled with for non Intel processors?

petschge · on Aug 28, 2020

The HPC codes I worked on we would compile with gcc, clang, icc and whatever vendor compiler was installed (Cray, PGI, something even worse). Then we'd benchmark the resulting binaries and make a recommendation based on speed. (Assuming the compiled binaries gave the correct results, which would sometimes fail and trigger further debugging to find out if we had undefined behaviors (or implementation defined) or had managed to find a compiler bug. For codes that are memory-bandwidth dominated the results are pretty much a toss-up. For compute bound codes intel would often win.

You can do the same when your machine has non-intel CPUs that are supported by a lot of compilers. If you are on power9 or arm the compiler list gets shorter. And a lot of supercomputers start to contain accelerators (often, but not always Nvidia GPUs) in which case there is often only one supported compiler and you have to rewrite code until that compiler is happy and produces fast code.

Kednicma · on Aug 28, 2020

It's a depressingly common choice for educational installations. Folks are trained to use ICC instead of GCC and then they keep using ICC when they leave school.

Lammy · on Aug 28, 2020

Unsure, but I think that's the implication of this bit in the OP:

"The same effect was documented with some of the most popular mathematical software packages, including Mathematica, Mathcad, and Matlab."

the_svd_doctor · on Aug 28, 2020

A lot of scientific/HPC code running on supercomputers is compiled with icc.

hosteur · on Aug 28, 2020

Wow. How is this not a lawsuit happening already?

ballenf · on Aug 28, 2020

Toward the end of the article, the several lawsuits and FTC actions are discussed. The end result of them is a disclaimer on the Intel compiler that it's not optimized for non-Intel processors and that Intel can't artificially hurt AMD performance (but it apparently has no obligation to support unique AMD optimizations either).

wtallis · on Aug 28, 2020

> (but it apparently has no obligation to support unique AMD optimizations either)

It's a bit worse than that. Intel has no obligation to support optimizations that aren't unique to AMD; they're allowed to disable SIMD extensions that AMD processors declare support for, while at the same time using all of those SIMD extensions on Intel CPUs. They just have to include the disclaimer that their compiler and libraries may be doing this.

colejohnson66 · on Aug 29, 2020

Why is it just the compiler maker’s job to report it may (read: will) underperform on AMD and not also the program developer too? If I paid for software that performed worse on AMD because it deliberately hobbled itself (and was not informed), I’d want a refund.

It’s straight up anti-competitive, but consumers aren’t smart enough to understand that it’s a problem; A consumer just sees biased benchmarks that show Intel outperforming AMD, and then choose Intel.

josefx · on Aug 29, 2020

> that Intel can't artificially hurt AMD performance (but it apparently has no obligation to support unique AMD optimizations either).

As far as I understand it quite the opposite, it explicitly mentions that it may not apply "optimizations that are not unique to Intel" to other processors. It wont select the optimal code path unless the CPU vendor ID is set to GenuineIntel and fall back to the worst path your compile settings include.

coronadisaster · on Aug 29, 2020

All companies are evil if you let them...

phendrenad2 · on Aug 29, 2020

Business is war, they say.

rbecker · on Aug 29, 2020

Except when discussing free trade.