Agner says that he sees no point in hyperthreading. Then he also complains that most AVX512 operations have a latency of 2 clock cycles.
Those two, I think, go hand in hand: the point is not using unused execution resources with the additional hyperthreads (thus the decode limitation), but using the other threads to fill the pipeline bubbles created by the higher latency instructions, a-la barrel processor.
edit: this of course works as KNL is aimed at throughput jobs, not anything latency sensitive.
This is correct, it is not possible to get peak IPC without using all the hyperthreads, by design. It is not a CPU even though it uses a CPU's ISA, so treating it as a CPU for optimization purposes is a mistake.
The advantage of its microarchitecture is that if you use the hyperthreads correctly, it will run a broad range of code at close to the theoretical IPC of the silicon without too much code effort (if you understand the model). This is in contrast to CPUs, which rarely get close to their theoretical IPC no matter what you do, or GPUs, which can only run a very narrow range of codes at close to theoretical IPC. In principle it is extremely efficient in terms of operation throughput, and it isn't sensitive to what those operations are, but it requires a large number of independent operations to be in flight to do its magic hence all the hypertheads.
While the architecture explicitly uses latency-hiding to get throughput, it is mostly about latency-hiding at the sub-microsecond level. As a practical matter, it shouldn't affect the perceived latency of most real-world software.
I've no idea in what way KNL isn't a CPU, or how optimizing for it is fundamentally different, from running one in an HPC setting. For vectorized floating point code of the sort for which we're particularly interested in these things (e.g. DGEMM), you get peak performance from a single thread/core.
The design is supposed to be "balanced", and it does appear to do a reasonable job of that for the sort of code that uses the bulk of the time on our system. In some cases, Broadwell will do substantially better, of course. I don't have URLs for performance examples to hand.
People mostly seem to be ignoring the potentially important additions in KNL -- the memory system and the built-in interconnect (though I don't know if the latter is available yet). Also the large core count should help by keeping more MPI communication local. I don't know of any relevant results for multi-node jobs, but 64 cores covers a fair number of the HPC jobs I see.
The names Knights Corner (KNC) and Knights Landing (KNL) are easily confused. KNC was designed as a plug-in card as you describe, but the current KNL generation has standalone systems running a standard operating system. While there are differences in where the bottlenecks are (see Agner's report) you can/should/must optimize for these standalone KNL systems as you would for a CPU: http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/...
hyperthreading != multithreading. He is saying, correctly, that the core doesn't have the resources to handle more than one thread per clock cycle (like for example server class CPUs can); but KNL has hyperthreading for the opposite reason: a single thread can't normally feed the core two new instructions every clock cycle, so you need more threads to keep the core busy on those clock cycles it would be otherwise stalled.
Although he dismisses it, he actually mentions the reason: 'perhaps [it is useful] for code that is limited by memory access, branch mispredictions, or long dependency chains': especially the first and the last are going to be common on code running on KNL.
In case anyone with pull at Intel is reading, let me say this:
It strikes me as insane that you are not making it easier for Agner to get early access to your hardware. He's writing fabulous manuals for free that fill the gaps in the official sources. His work helps more programmers to get the most out of your hardware than just about anything else available. You should be throwing money at him to keep him doing this, or at least showering him offers of free pre-release hardware. But instead you are depending on his generosity and leaving it to chance whether he eventually finds a machine to test on.
I'm going to assume it's because Intel don't value the community surrounding their products very well. at least, the ones that append to their work in a way.
a lot of companies tend towards that behavior. a lot of single developers tend towards that behavior. I don't like it because it represents a kind of destructive egotism.
hopefully that's not what it is and it's just Intel not thinking too hard.
Well, it's been a few years since I worked at Intel, but I'm guessing it's just a misalignment of incentives. It's not that some people inside Intel don't realize who their friends are, it's just that those who do realize who their friends are do not get measured on treating their friends well. Nor does their boss, nor their boss's boss. Until their boss's boss does get measured on how well they treat Intel's friends. Then the friends get smothered in a bunch of distracting attention.
You get what you measure for. Intel measures a lot. Intel fails to measure some important things.
The same is probably true of any company with hyper-focused management. Maybe even your startup.
Seems to me Intel is placing their bet on the compiler doing the lion's share of the work here -- and where hand optimization is needed, another bet that having a consistent instruction set across their big Xeon cores and these MIC devices will make things easier for developers?
Actually the whole masking and scatter-gather in AVX512 is to simplify the job of the compiler by pretty much allowing any loop to be vectorized relatively trivially.
It doesn't really require any new compiler breakthrough, as it is pretty much what all the GPU compilers (i.e. cuda) have been doing for a while already.
Not to mention scatter-gather and masking in Cray's vector processors. You do still need dependency understanding in the compiler, but that's pretty easy now compared to the late 1970s.
So, how was that 20 year nap you just woke up from? That's pretty much the way everyone does it now, and has been for some time.
The central lesson of RISC is not "fewer, simpler instructions is good", it is "let the compiler do what it does well, and let the hardware do what it does well." The increase in available silicon has been moving that boundary for many years. Ever since we finally had enough silicon to implement out-of-order execution with synchronous exceptions, for example, register allocation is no longer a problem to be solved in the compiler.
Classic RISC only made sense when available bandwidth to main memory was at rough parity with on-die memory bandwidth. Those days were brief. But the idea of intelligently dividing optimization tasks between the CPU and the compiler is a timeless idea. The optimal implementation, however, is a function of available technology on both sides.
Intel has been betting on the compoler doing low level optimizations for about forever now, their compilers and software development tools are not a small part of their business.
At this point Intel is betting that anyone who's capable of doing large scale low level optimization will be designing their own hardware including the CPU so they'll better focus on high performance computing for the masses.
Interoperability with x86 big cores is also what intel wants because it means that the software can run on anything, even GPU based HPC efforts want X86 compatibility this is why AMD and Intel dragged NVIDIA to court a couple years back.
I don't know about that, but Intel are putting some effort into relevant libraries (typically free software, other than MKL, I'm pleased to say). An example is the small matrix multiplication library libxsmm, which is written up for Supercomputing 16 as referenced from the repo on github. ("Simple loops"...)
Having consistent instruction set with CPU doesn't help much. Learning new instruction doesn't takes that long as more recent instructions set all shares same ideas and don't really have quirks like in the old days.
When you go hand-optimizing, you want to squeeze the last bit of performance. And that (currently) is very CPU-specific. Because all CPUs have different cache/memory system, instruction latency, and pipeline structure. When I was doing low-level work (for video processing), a lot of time I has to strike a balance across various microarchicture. Some more extreme people (including the code by Intel ICC) target each CPU model specifically.
Because the MIC has vastly different architecture, the resulting hand-optimized code will be vastly different. You still need to learn both of them anyway.
It doesn't seem like a particularly bad approach. I would guess that most developers don't have the skills (or time, for that matter) to be able to efficiently write code at the hand optimisation level.
Abstracting this away to the compiler is a safe bet, and developers will soon learn ways to at least generally optimise the code for the compiler, much like people have already learned to optimise code for javac, v8, etc.
Those two, I think, go hand in hand: the point is not using unused execution resources with the additional hyperthreads (thus the decode limitation), but using the other threads to fill the pipeline bubbles created by the higher latency instructions, a-la barrel processor.
edit: this of course works as KNL is aimed at throughput jobs, not anything latency sensitive.