More

lkcl · on July 10, 2021

yes. it is however a test ASIC. therefore it has no on-board boot ROM, and has to have programs uploaded to it over JTAG.

vfclists · on July 12, 2021

Is it 32 bit or 64 bit?

lkcl · on July 10, 2021

we use nmigen (python-based OO HDL) which through yosys generates verilog as an automatic step.

180nm is still by far and above the world's most heavily-used geometry, because the price-performance (bang per buck, however you want to put it) is so extremely high.

an 8in wafer is USD 600 and that's extremely low. any power MOSFET, power transistor, diode or other high current semiconductor you absolutely don't want small "things" (detailed tiny tracks) you want MASSIVE ones.

why on earth would you waste money on tiny features, it's like using the latest 0.15mm 3D printing nozzles to 3D print a massive 300x300x300 mm cube that's going to be used for nothing more than a foot-stool. you want a 1.2mm nozzle for that!

then any processor below 300 mhz, you can get away with 180nm. need only an 8 mhz 8-bit or 4-bit washing machine or microwave processor, or something to go in a cheap digital watch? 180nm is your best bet: you'll get tens of thousands of < 1 mm^2 ASICs on a single wafer which means you're well below $0.05 per individual die.

a 28nm 8in wafer would be about... 10x that cost, you'd end up with exactly the same transistor (or 8 mhz 8-bit processor), why would you pay more money for what you don't need?

btw the real reason why there's a chip shortage: the Automotive industry, who are cheap bar-stewards, wanted even lower than $600 per 8in wafer so they went with 360nm and cruder geometry. that's equipment that's even older than the 1990s, like 40+ years in some cases.

so then the stupidity hit, and they stopped ordering. then 18 months later they phone up these old Foundries and say, "ok, we're ready to start ordering again". and the Foundries say, "oh, we switched off the equipment, and it cooled down and got damaged (just like that massive Electric plant in S. Australia that was de-commissioned, the concrete cracked when they switched it off, and it's completely unsafe to start up again). you were our only customer for the past 30 years, so we scrapped it all. you'll have to now compete with the consumer-grade smaller geometry Fabs like everyone else".

which is something that none of the Automotive companies have told their Governments, because then they can't go crying "boo hoo hoo, we can't make chips any more at the price that we demand, waaa, waaaa, i wannnt myyy monneeeeey"

and now of course they can't use the old masks, because those were designed for 360nm and cruder geometries, they have to redesign the entire ASIC for 180nm and that's why you can't now get onto 180nm and other MPW Programmes because the frickin Automotive Industry has jammed them all to hell.

lkcl · on July 9, 2021

you can get a pretty good idea right now, the simulator is functional and the unit tests include explanations in english:

https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/...

i'm currently in the middle of a rabbit-hole exploration of being able to do in-place RADIX-2 FFT, DCT and DFT butterflys, the target is a general purpose function to cover each of those, in around 25 Vector instructions.

not 2,000 optimised loop-unrolled instructions specifically crafted for RADIX-8, another for RADIX-16, another for RADIX-32 ..... RADIX-4096 (as is the case in ffmpeg): 25 instructions FOR ANY 2^N FFT.

btw if you're interested in "real-world" SVP64 Vector Assembler we have the beginnings of an ffmpeg MP3 CODEC inner loop:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=medi...

that's under 100 instructions, more than 4x less assembler for the same job in PPC64. and 6.5 times less assembler than ffmpeg's optimised x86 apply_window_float.S

you will no doubt be aware of the huge power savings that brings due to reduced L1 cache usage.

lkcl · on July 9, 2021

yes, so the "normal" way that GPUs work is: the architecture and the ISA are so staggeringly optimised they're completely incompatible and incapable of running standard (general-purpose) workloads. no MMU, vast wide SIMD engines, massive numbers of parallel memory interfaces that run really slowly but can handle (when added up) vast bandwidth far in excess of "normal" processor memory, and so on.

on top of that, because it's an entirely separate processor, to get it to do anything you actually have to have a Remote Procedure Call system, operating over Shared Memory!

oink.

so the process for running a GPU shader binary is as follows:

step 1: fire up a compiler (in userspace) step 2: compiler takes the shader IR and turns it into GPU assembler step 3: the userspace program (game, blender, whatever) triggers the linux kernel (or windows kernel) to upload that GPU binary to the GPU step 4: the kernel copies that GPU binary over Shared Memory Bus (usually PCIe) step 5: now we unwind back to userspace (with a context-switch) and want to actually run something (OpenGL call) step 6: the OpenGL call (or Vulkan) gets some function call parameters and some data step 7: the userspace library (MESA) "packs" (marshalls) those function call parameters into serialised data step 8: the userspace library triggers the linux (windows) kernel to "upload" the serialised function call parameters - again over Shared Memory Bus step 9: the kernel waits for that to happen step 10: the userspace proceeds (after a context-switch) and waits for notification that the function call has completed...

... i'm not going to bother filling in the rest of the details, you get the general idea that this is completely insane and goes a long way towards explaining why GPU Cards are so expensive and why it takes YEARS to reverse-engineer GPU drivers.

in the Libre-SOC architecture - which is termed a "Hybrid" one, the following happens:

step 1: the compiler is fired up (in userspace, just like above) step 2: compiler takes the shader IR and turns it into *NATIVE* (Power ISA with Cray-style Vectors and some custom opcodes) assembler step 3: userspace program JIT EXECUTES THAT BINARY NATIVELY RIGHT THERE RIGHT THEN

done.

did you see any kernel context-switches in that simple 3-step process? that's because there aren't any needed.

now, the thing is - answering your question a bit more - that "just having vector capabilities" is nowhere near enough. the lesson has been learned from Nyuzi, Larrabee, and others: if you simply create a high-performance general-purpoes Vector ISA, you have successfully created something that absolutely sucks at GPU workloads: about TWENTY FIVE PERCENT (one quarter) of the capability of a modern GPU for the same power consumption.

therefore, you need to add SIN, COS, ATAN2, LOG2, and other opcodes, but you need to add them with "reduced accuracy" (like, only 12 bit or so) because that's all that's needed for 3D.

you need to add Texture caches, and Texture interpolation opcodes (takes 4 pixels @ 00 01 10 11 square coordinates, plus two FP XY numbers between 0.0 and 1.0, and interpolates the pixels in 2D).

you need to add YUV2RGB and other pixel-format-conversion opcodes that are in the Vulkan Specification...

and many more.

but, we first had to actually, like, y'know, have a core that can actually execute instructions at all? :) and that's what this first Test ASIC is: a first step.

phendrenad2 · on July 9, 2021

Awesome job. I tried to make a simple GPU in chisel w/ hardfloat. I also came to the conclusion that Larrabee was a joke and dedicated triangle interpolation hardware was necessary, but I didn't consider the half-float(?) or caches or other additions you had to make.

lkcl · on July 9, 2021

thx phndrenad2. funny i just searched "chisel gpu" and found two: https://github.com/jbush001/ChiselGPU https://github.com/Chlorophytus/broccoli

half-float we'd like to do by using a dynamic SIMD-aware 64-bit ALU that has auto-partitioning. we do however already have an actual FP16 implementation https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/iee...

or more to the point, one that is compile-time configureable with one parameter (bit-width), so the same HDL does FP16, FP32 and FP64. i'd like to make that dynmaically-SIMD-configureable but it'll take some base work in nmigen to do without massive code-explosions.

lkcl · on July 9, 2021

we'll be going as far as is practical and pragmatic with the actual hardware, and still actually meet user-expectations. firmware, bootloader, OS, drivers, BIOS: definitely.

lkcl · on July 8, 2021

you'll be fascinated to know that we picked a python-based (Object-Orientated) HDL - nmigen - for exactly this reason.

we've developed a dynamically SIMD-partitionable-maskable set of "base primitives" for example, so you set a "mask" and it automatically subdivides the 64-bit adder into two halves. but we didn't leave it there, we did shift, multiply, less-than, greater-than - everything.

https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/iee... https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/iee...

can you imagine doing that in VHDL or Verilog? tens of engineers needed, or some sort of macro-auto-generated code (treating VHDL / Verilog as a machine-code compiler target).

the reason for doing this - planning it well in advance - is because we're doing Cray-style Vectors (Draft SVP64) with polymorphic element-width over-rides. yes, really. the "base" operation is 64-bit, but you can over-ride the source and destination operation width.

the reason why we're using our own Cell Library is actually down to transparency. we want customers to be able to compile the GDS-II files themselves, fully automated, no involvement from us, no manual intervention.

ironically, as an aside: Staf's Cells are 30% smaller (by area) than the Foundry equivalents.

lkcl · on July 8, 2021

ultimately what we'd like to see is entirely NDA-free PDKs even for 12nm and below, and you can run the VLSI tools and generate the EXACT GDS-II yourself, then yes, de-cap the processor and do a digital comparison.

before you even get to that stage, you run the Formal Correctness Proofs and unit tests on the HDL, so that YOU have confidence that the HDL which you're about to generate the GDS-II files from is actually correct and does the damn job.

example of a Formal Correctness Proof for the fixed arithmetic Power ISA pipeline:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/alu...

runs with symbiyosys, so you end up running SAT Solvers like yices2 and z3.

basically we absolutely do not want to be the people you come to and say, "can we trust your ASIC?" and like Intel they lie to you and say "of course!", we want to say, "don't bloody well ask us, go run the damn tools yourself! oh, btw, if you want help with that we charge USD 5k per hour"

lkcl · on July 8, 2021

thanks :)

lkcl · on July 8, 2021

no, it's pretty basic, and implicit: it's the (newly-created) "Scalar Fixed-Point Compliancy Subset) - i added a bit to the wikipedia page last month about them https://en.wikipedia.org/wiki/Power_ISA#Compliancy

it's 64-bit, LE/BE, and it's implementing a "Finite State Machine" (similar technique to picorv32, if you know that design). this because we wanted to keep it REALLY basic, and also very clear as a Reference Design, none of the "optimised pipelined decoders and issuers" that you normally find, which make it really, really difficult to see what the hell is going on.

bear in mind this includes SVP64: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple...

if you go back several revisions, the non-Vectorised version is like... 400 lines?

lkcl · on July 8, 2021

we used an entirely Libre-licensed VLSI "compiler", which takes HDL as input and spits out fully-completed GDS-II Files.

the problem with this particular irate individual is that he's assumed that because TSMC's DRC rules are only accessible under NDA that automatically absof*** everything was also "fake open source".

idiot.

sigh.

clearly didn't read the article.

whilst both Staf Verhaegen and LIP6.fr signed the TSMC Foundry NDA, we in the Libre-SOC team did not. we therefore worked entirely in the Libre world, honoured our committment to full transparency, whilst Staf and Jean-Paul and the rest of the team from LIP6 worked extremely hard "in parallel".

the ASIC can therefore be compiled with three different Cell Libraries:

* LIP6.fr's 180nm "nsxlib" - this is a silicon-proven 180nm Cell Library * Staf's FreePDK45 "symbolic" cell library using FlexLib (as the name says, it uses the Academic FreePDK45 DRC) * the NDA'd TSMC 180nm "real" variant of Staf's FlexLib

i was therefore able to "prepare" work for Jean-Paul, via the parallel track, commit it to the PUBLIC REPOSITORY (the one that's open, that our resident idiot didn't bother to check existed or even ask where it is), which saved Jean-Paul time whilst he focussed on fixing issues in coriolis2.

it was a LOT of work.