> GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on
CPUs.
IMO these constraints are overstated and OpenCL offers a good abstraction for CPUs too.
This used to be a fairly universally held opinion, but recently I've begun to see fairly optimistic results reported in the literature (see [1] for an example study, unfortunately not open source). As someone who writes numerical simulation code for a living, I've started to get curious about OpenCL for the use case of rapidly developing code that's reasonably performance portable between Xeon-based and GPU workstations.
Since you seem like you have some experience with testing performance portability with OpenCL and other solutions, I'd be curious to hear if you have any comments about the reference I linked or more general suggestions for alternative means to achieving the same end (performance portability between CPU/GPU architectures at the workstation level).
Hi,
I would say it depends on the kind of code you are working with. Our code is quite branchy and somewhat complicated.(3d rendering software)
For simpler code that does e.g. lots of the same thing on a regular grid, OpenCL might work better than it does on our code (targetting CPUs).
I wouldn't know how to utilize this yet, but if I write a C program that utilizes SIMD similar to the way you can in the Dart programmng language with its Float32x4 and Int32x4 types [1], that same program can be expanded to use the 4 cores with SPMD to do 4 (SIMD) vectors x 4 (SPDM) parallel tasks on my quad-core 2013 i7-4700MQ by using the Intel compiler and specially-written C code for a maximum 1600% speedup of 4x4 matrix operations? I am guessing it would be more like 800% to 1200% in reality if lucky, but still promising.
So this compiler is targeting the SIMD units of CPUs rather than GPUs. Can anyone contrast what the performance of this would be relative to Cuda or OpenCl for various applications, for example neural nets?
For large data and trivial algorithms, such as multiplying matrices (which means then any problem you can express as a set of operations on large matrices) gpu's do really well, so it's hard to compete with something that has 1000 cores (Edit: "compute units", not "cores"). Neural nets is essentially matrix multiplication.
However a lot of interesting problems are seemingly parallel but highly branching and nonlinear. Take path tracing as an example: it's very little code and highly parallel as each Ray/pixel is independent, yet it's not an easy problem for a GPU: each time a ray bounces it will disperse and not do whatever the Ray next to it was doing in terms of which geometry it will hit etc.
It might seem like today if a problem can benefit from 8 CPU cores then it benefits 100x more from being run on a GPU but this is far from true. A great machine for general computing could do well with a board with 100 x86 CPUs apart from having a big gpu with a thousand cores for brute forcing the "simpler" problems.
> A great machine for general computing could do well with a board with 100 x86 CPUs apart from having a big gpu with a thousand cores for brute forcing the "simpler" problems.
Which is the idea behind Intel's Xeon Phi "GPU" with 70+ Pentium/Atom cores, which this compiler specifically targets.
[Long before that, Intel showcased an 80 core x86 CPU in 2007 (Polaris/Teraflops Research Chip) – and then promptly shelved it to focus on building programming languages and compilers that can actually make use of it, before introducing the Xeon Phi half a decade later.]
The problem with having 100 x86 CPU on one board is NUMA becomes a bad problem. When memory accesses from one CPU to different regions of memory will have quite different bandwidths and latencies, you're much better off by acknowledging that upfront, designing a proper interconnect (Infiniband), and not sharing memory between threads but rather communicating explicitly (MPI).
> ... so it's hard to compete with something that has 1000 cores.
Any references to what has "1000 cores"? Nvidia GPUs usually have about 12 or so cores that can be compared to x86 cores, meaning they can independently branch.
For example high end Nvidia 980 GTX GPU has only 16 of such comparable SIMD execution cores. SMXs or whatever Nvidia calls them.
GPU marketing materials confusingly refer as cores to something like x86 CPU SIMD lanes (and that's being very generous to GPUs), that artificially inflates the numbers.
Or put differently, one CUDA core can compute up to 1 FMA per cycle @1196-1300 (?) MHz. One recent Intel X86 core can compute at least up to 16 FMAs per cycle @2800-4000 Mhz.
Sorry, should have written "compute units" - the point is as you point out that flops is one thing but branch prediction and feeding those floating point units is another.
There has been a surge in the tractability of massive but simple linear algebra problems lately, such as deep learning, which might have given the impression that GPUs are the answer to any supercomputing.
I am not very familiar with NVidia hardware but I imagine an SMX is not the smallest unit, which can branch. A "warp" can branch independently and it's 32 lanes wide so I figure an SMX core with 192 "CUDA cores" can run 6 warps. It's still hundreds of cores and not thousands but much more than a dozen.
SMXs were basic silicon unit being tiled in NVidia's Kepler generation and SMMs are basically the same thing but for Maxwell.
A "warp" is analogous to a hardware thread and you'd have up to 64 of those being scheduled on each SMX or SMM. Each of those SMX/SMMs has four warp schedulers which issue instructions to execution units. In an SMX the schedulers can issue to any of the 192 execution lanes but in an SMM each scheduler has it's own set of execution lanes. If we call a core anything that can independently issue instructions then I guess you'd call an SMX a core but on a SMM each warp scheduler looks like it's own core. But this is all further complicated by the fact that an instruction issued to one lane can be crossed over to a lane that's become idle due to predication. Which is maybe sort of like scheduling but not really.
But yes, you can't compare "CUDA cores" to actual cores and GPUs aren't equivalent to thousands of cores. The GM204 would have 64 core equivalents and most other chips would have less.
I think WARP is more like hardware thread, and one SMX is processing one particular WARP per clock cycle. So on any given clock cycle you still have just as many independent simultaneous control paths as you have SMX units.
Not quite. All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code) but, indeed, only one can execute control flow instructions at a time since the control unit is shared in the SMX.
Well, GPUs don't have any branch prediction or out of order capabilities, so you need to have a way to keep execution units (mainly floating point units) busy.
A WARP is really nothing more than a way to have work for SMXs (and computational units it controls) at as many clock cycles as possible. You need some way for masking FPU pipeline and memory latency.
> All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code)
It's not that different from x86 hyperthreading, just with more hardware threads. Pipelined execution units are fed each clock cycle by the core. Multiple FP operations are in flight in parallel, otherwise CPUs won't get the performance numbers either.
Sure, an SMX can also switch between warps in the manner similar to hyperthreading on x86 but it does not mean it executes a single warp at a time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak performance of 4.29 Tflops. If each SMX could only execute a warp at a time it could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x 2(two flops per FMA) = 720Gflops.
The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.
It's true that problems with diverging control flow work better on a number of cores than on a GPU. But by the same token a GPU's SIMT execution model does better with ray tracing than the SIMD units on a CPU. And the article is about targeting SIMD units.
"Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU" is a good paper on this; obviously has an agenda, but rings true in my experience.
That's a good reference - however the drawback of comparing a GTX280 vs an i7 from the same period, is that now 8 years later the GPUs have scaled quite well for the same set of (simple) problems, with more units/bandwidth whereas CPU performance hasn't. The difference between todays biggest graphics cards and the GTX 280 is larger than the difference between the i7 from 2008 and the big desktop CPU from 2016. The "100x" is still a myth, but we are significantly closer today than 2008.
2008 Bloomfield i7 can do 8 FP ops per clock cycle. Recent Intel CPUs can do 32 FP ops per clock.
Bloomfield era you could have 4 (?) cores per CPU socket. Now Broadwell EP has 22.
Only thing that hasn't scaled much CPU side is memory bandwidth. I think it's only a matter of time until Intel integrates HBM2 or something like it to same package. They've already done that for eDRAM.
> Bloomfield era you could have 4 (?) cores per CPU socket.
4 per socket, and at most 2 sockets per board.
Broadwell-EX, to be released this quarter, has 24 cores and up to 8 sockets per board.
So 64 FP ops per machine and cycle versus… 6144.
In the same time, GPUs went from 900 GFLOPS per card, 2 cards per machine (1800 GFLOPS total vs. 192 on CPU), to 9600 GFLOPS per card, 4 cards per machine (38400 vs. 12000). GPUs are still faster, but the advantage isn't that significant any more.
32 FP ops per clock? I don't think that is true, and if it is it is very misleading.
First with Haswell introducing the fused multiply add, suddenly all the 'peak flops' numbers doubled, which is technically true, but only if everything you do is a fused multiply add (with no cache misses of course).
Even so only the Xeon Phi (and only the unreleased silvermont cores?) has 16 wide vector units, even Skylake still has 8 wide AVX units, which would be 16 fma operations.
Are you saying that AVX instructions are pipelined (or some other technique) and have a throughput greater than their width per cycle?
> First with Haswell introducing the fused multiply add, suddenly all the 'peak flops' numbers doubled, which is technically true, but only if everything you do is a fused multiply add (with no cache misses of course).
Yeah, FMA (fused multiply-adds).
Better or worse, it's de facto standard to quote one FMA as two FLOPS, because it's a very commonly combined operation.
> Are you saying that AVX instructions are pipelined (or some other technique) and have a throughput greater than their width per cycle?
Yeah, AFAIK, they're (mostly) pipelined and dual issue per clock.
One significant benefit of CPU SIMD is that you also do not need to manage temperamental GPUs and GPU drivers. The CPU programming experience is much nicer, the infrastructure more robust, and you can generally expect some sort of SIMD support everywhere. It is not too hard to support SIMD-with-fallback code (and SIMD will generally just work, if supported). GPU support will often require configuration on the part of the user, especially if the system has several GPUs.
(I say this as a GPU language developer. They are fast, but also a bit of a pain in the ass.)
I think because Vulkan won't be nearly as niche GPU driver support and consistency will need to be much better than they have been for OpenCL. Where that stands for flexibility of the compute side I can't say.
SPMD is conceptually more like a job system, so this generates code that is not necessarily all executing same code in lockstep. SPMD/SIMD is "single program" vs "single instruction". The compiler does also exploit SIMD but that's orthogonal to the SPMD concept.
That would depend heavily on the hardware and how the program was written. It is an apples and oranges comparison.
I will say this though, a program written well in ISPC with cache locality taken into account together with SIMD can run 100x faster than a naive C program.
Potentially significant, since they add new types and control structures. The new types inform the optimizer if data is local to SIMD computations or not, can optimize memory layout for SIMD computations, and the control structures allow for converged control-flow at runtime where only one branch is taken by all threads.
See the full paper I link to as a top-level comment for more details.
With intrinsics you have to write each instruction by hand. Not only is this a lot of work and compiler specific but if you aren't familiar with all of the instructions at your disposal it is unlikely you will get the same performance. Not only that but ISPC can be compiled to multiple different SIMD lane widths so that something doesn't need to be re-written with the width increases or decreases.
One example would be the n-body simulation of the computer language benchmarks game. The C++ version uses intrinsics but wouldn't benefit from anything that can do 4 doubles instead of only two at a time.
I would argue that learning the 'basics' of intrinsic is less of a burden than learning a new C extension and modifying your existing code base to include additional new build tools.
OTOH, I like the idea of automatic lane width detection.
I've learned ISPC and didn't find it too difficult. It largely boils down to the varying and uniform keywords along with one more loop syntax. I've looked into intrinsics and I'm extremely skeptical that there is much benefit there other than to play with the actual CPU instructions. ISPC produces tiny .o | .obj files and a header file, the integration has not been a hurdle that I can remember.
Based on my understanding of both, learning the new C extensions would be easier than using the intrinsics. Whatever concepts you need to understand the new C extensions (which are limited to type modifiers and a few new control-flow modifiers) you will also need to understand to use the intrinsics.
However, I agree that having another dependency in your build may be more of a problem.
How many cores and SIMD lanes? What type of ANN did you use if for? Was it hard to implement into your existing code? This is a big deal for me, since I gave up C for Erlang/LFE, and I am working my way through Gene Sher's book 'Handbook of Neuroevolution Through Erlang'. I like the conceptual fit, but I would be more interested in the present to try this on some old C coded NNs.
I caught that too, and I was surprised, but like all closed research, it was there. I am guessing it will help to reinvigorate some uses for older Intel chips, and certainly newer ones. All good for Intel, and those who need the speedup.