yes, so the "normal" way that GPUs work is: the architecture and the ISA are so staggeringly optimised they're completely incompatible and incapable of running standard (general-purpose) workloads. no MMU, vast wide SIMD engines, massive numbers of parallel memory interfaces that run really slowly but can handle (when added up) vast bandwidth far in excess of "normal" processor memory, and so on.
on top of that, because it's an entirely separate processor, to get it to do anything you actually have to have a Remote Procedure Call system, operating over Shared Memory!
oink.
so the process for running a GPU shader binary is as follows:
step 1: fire up a compiler (in userspace)
step 2: compiler takes the shader IR and turns it into GPU assembler
step 3: the userspace program (game, blender, whatever) triggers the linux kernel (or windows kernel) to upload that GPU binary to the GPU
step 4: the kernel copies that GPU binary over Shared Memory Bus (usually PCIe)
step 5: now we unwind back to userspace (with a context-switch) and want to actually run something (OpenGL call)
step 6: the OpenGL call (or Vulkan) gets some function call parameters and some data
step 7: the userspace library (MESA) "packs" (marshalls) those function call parameters into serialised data
step 8: the userspace library triggers the linux (windows) kernel to "upload" the serialised function call parameters - again over Shared Memory Bus
step 9: the kernel waits for that to happen
step 10: the userspace proceeds (after a context-switch) and waits for notification that the function call has completed...
... i'm not going to bother filling in the rest of the details, you get the general idea that this is completely insane and goes a long way towards explaining why GPU Cards are so expensive and why it takes YEARS to reverse-engineer GPU drivers.
in the Libre-SOC architecture - which is termed a "Hybrid" one, the following happens:
step 1: the compiler is fired up (in userspace, just like above)
step 2: compiler takes the shader IR and turns it into *NATIVE* (Power ISA with Cray-style Vectors and some custom opcodes) assembler
step 3: userspace program JIT EXECUTES THAT BINARY NATIVELY RIGHT THERE RIGHT THEN
done.
did you see any kernel context-switches in that simple 3-step process? that's because there aren't any needed.
now, the thing is - answering your question a bit more - that "just having vector capabilities" is nowhere near enough. the lesson has been learned from Nyuzi, Larrabee, and others: if you simply create a high-performance general-purpoes Vector ISA, you have successfully created something that absolutely sucks at GPU workloads: about TWENTY FIVE PERCENT (one quarter) of the capability of a modern GPU for the same power consumption.
therefore, you need to add SIN, COS, ATAN2, LOG2, and other opcodes, but you need to add them with "reduced accuracy" (like, only 12 bit or so) because that's all that's needed for 3D.
you need to add Texture caches, and Texture interpolation opcodes (takes 4 pixels @ 00 01 10 11 square coordinates, plus two FP XY numbers between 0.0 and 1.0, and interpolates the pixels in 2D).
you need to add YUV2RGB and other pixel-format-conversion opcodes that are in the Vulkan Specification...
and many more.
but, we first had to actually, like, y'know, have a core that can actually execute instructions at all? :) and that's what this first Test ASIC is: a first step.
Awesome job. I tried to make a simple GPU in chisel w/ hardfloat. I also came to the conclusion that Larrabee was a joke and dedicated triangle interpolation hardware was necessary, but I didn't consider the half-float(?) or caches or other additions you had to make.
or more to the point, one that is compile-time configureable with one parameter (bit-width), so the same HDL does FP16, FP32 and FP64. i'd like to make that dynmaically-SIMD-configureable but it'll take some base work in nmigen to do without massive code-explosions.
on top of that, because it's an entirely separate processor, to get it to do anything you actually have to have a Remote Procedure Call system, operating over Shared Memory!
oink.
so the process for running a GPU shader binary is as follows:
step 1: fire up a compiler (in userspace) step 2: compiler takes the shader IR and turns it into GPU assembler step 3: the userspace program (game, blender, whatever) triggers the linux kernel (or windows kernel) to upload that GPU binary to the GPU step 4: the kernel copies that GPU binary over Shared Memory Bus (usually PCIe) step 5: now we unwind back to userspace (with a context-switch) and want to actually run something (OpenGL call) step 6: the OpenGL call (or Vulkan) gets some function call parameters and some data step 7: the userspace library (MESA) "packs" (marshalls) those function call parameters into serialised data step 8: the userspace library triggers the linux (windows) kernel to "upload" the serialised function call parameters - again over Shared Memory Bus step 9: the kernel waits for that to happen step 10: the userspace proceeds (after a context-switch) and waits for notification that the function call has completed...
... i'm not going to bother filling in the rest of the details, you get the general idea that this is completely insane and goes a long way towards explaining why GPU Cards are so expensive and why it takes YEARS to reverse-engineer GPU drivers.
in the Libre-SOC architecture - which is termed a "Hybrid" one, the following happens:
step 1: the compiler is fired up (in userspace, just like above) step 2: compiler takes the shader IR and turns it into *NATIVE* (Power ISA with Cray-style Vectors and some custom opcodes) assembler step 3: userspace program JIT EXECUTES THAT BINARY NATIVELY RIGHT THERE RIGHT THEN
done.
did you see any kernel context-switches in that simple 3-step process? that's because there aren't any needed.
now, the thing is - answering your question a bit more - that "just having vector capabilities" is nowhere near enough. the lesson has been learned from Nyuzi, Larrabee, and others: if you simply create a high-performance general-purpoes Vector ISA, you have successfully created something that absolutely sucks at GPU workloads: about TWENTY FIVE PERCENT (one quarter) of the capability of a modern GPU for the same power consumption.
therefore, you need to add SIN, COS, ATAN2, LOG2, and other opcodes, but you need to add them with "reduced accuracy" (like, only 12 bit or so) because that's all that's needed for 3D.
you need to add Texture caches, and Texture interpolation opcodes (takes 4 pixels @ 00 01 10 11 square coordinates, plus two FP XY numbers between 0.0 and 1.0, and interpolates the pixels in 2D).
you need to add YUV2RGB and other pixel-format-conversion opcodes that are in the Vulkan Specification...
and many more.
but, we first had to actually, like, y'know, have a core that can actually execute instructions at all? :) and that's what this first Test ASIC is: a first step.