I am not very familiar with NVidia hardware but I imagine an SMX is not the smallest unit, which can branch. A "warp" can branch independently and it's 32 lanes wide so I figure an SMX core with 192 "CUDA cores" can run 6 warps. It's still hundreds of cores and not thousands but much more than a dozen.
SMXs were basic silicon unit being tiled in NVidia's Kepler generation and SMMs are basically the same thing but for Maxwell.
A "warp" is analogous to a hardware thread and you'd have up to 64 of those being scheduled on each SMX or SMM. Each of those SMX/SMMs has four warp schedulers which issue instructions to execution units. In an SMX the schedulers can issue to any of the 192 execution lanes but in an SMM each scheduler has it's own set of execution lanes. If we call a core anything that can independently issue instructions then I guess you'd call an SMX a core but on a SMM each warp scheduler looks like it's own core. But this is all further complicated by the fact that an instruction issued to one lane can be crossed over to a lane that's become idle due to predication. Which is maybe sort of like scheduling but not really.
But yes, you can't compare "CUDA cores" to actual cores and GPUs aren't equivalent to thousands of cores. The GM204 would have 64 core equivalents and most other chips would have less.
I think WARP is more like hardware thread, and one SMX is processing one particular WARP per clock cycle. So on any given clock cycle you still have just as many independent simultaneous control paths as you have SMX units.
Not quite. All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code) but, indeed, only one can execute control flow instructions at a time since the control unit is shared in the SMX.
Well, GPUs don't have any branch prediction or out of order capabilities, so you need to have a way to keep execution units (mainly floating point units) busy.
A WARP is really nothing more than a way to have work for SMXs (and computational units it controls) at as many clock cycles as possible. You need some way for masking FPU pipeline and memory latency.
> All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code)
It's not that different from x86 hyperthreading, just with more hardware threads. Pipelined execution units are fed each clock cycle by the core. Multiple FP operations are in flight in parallel, otherwise CPUs won't get the performance numbers either.
Sure, an SMX can also switch between warps in the manner similar to hyperthreading on x86 but it does not mean it executes a single warp at a time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak performance of 4.29 Tflops. If each SMX could only execute a warp at a time it could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x 2(two flops per FMA) = 720Gflops.
The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.