Does anyone know of a good wiki for doing multicore RISC-V on FPGA? Something mo...

aseipp · on Dec 23, 2022

A k-LUT SRAM-based FPGA where k=6 is something like 100x more inefficient in terms of raw transistor count than the equivalent ASIC gates when I last did some napkin math (though there's a statistical element in practice when considering k-LUTs vs fixed netlists.) But the SRAM requirements scale 2^k with the LUT size, so the highest you get in practice today is 6-LUTs and 80% of vendors do 4-LUT. And then you need all the transistors for the scan chain to configure each bit, actual block RAM, fixed function DSPs, etc. Clocking them to really high frequencies is also difficult. They're mostly overpriced because the market is small and the only competitors in town can rob you; bulk buys from having costs 1/50th the list-price, per-device, isn't unusual.

But the big problem isn't really the hardware. You can go create and simulate a robust design in a number of HDLs, there are all kinds of tools to do it now, though they aren't as fancy as proprietary ones. It's doable. But it's having a good software stack that matters. You can blow a billion dollars on a chip and still lose because of the software. Nvidia figured this out 10 years ago and everybody else is still playing catch up.

And it takes an insane amounts of software engineering to make a good compute stack. That's why Nvidia is still on top. Luckily there are some efforts to help minimize this e.g. PyTorch-MLIR for ML users, though more general compute stack needs, such as accelerated BLAS or graph analytics libraries or OpenMPI, are all still "on you" to deliver. But if you stick your head in the sand about this very important point, you're going to have a bad time and waste a lot of money.

zozbot234 · on Dec 24, 2022

A 100× transistor count would amount to basically 6 to 7 doublings of Moore's law. Or 10× in nm device lengths. So once the inherent difficulties of designing a chip for trailing-edge ASIC are addressed (with better free EDA tools and such) it seems that FPGA-based commercial products (as opposed to use of FPGAs for bespoke prototyping needs) should become quite uncompetitive. There's also structured ASIC, multi-project wafer, etc. design approaches that are sort of half-way, and might provide an interesting alternative as well. OTOH, FPGA's might also be more easily designed to integrate pre-built components like CPU cores, and the 100× rule wouldn't apply to such parts if used in an FPGA-based design.

imtringued · on Dec 24, 2022

100x transistor count is a big enough difference that you would want some sort of ALU and branching/FSM/loop unit arranged in an array with a few fpga elements on inputs and outputs.

It sounds to me that the real problem is still that the ideal programming model hasn't been created yet.

What would it look like? Compile a C function into an FPGA pipeline? A dataflow language where you explicitly define processes and their implementation?

I mean imagine if you could take a mathematical formula, how would it translate into a series of additions, subtractions and multiplications? You could write it down in Fortran and then you would have a well defined and ordered tree of operations. Do we just translate that tree into hardware? Like, you have 100 instructions with no control flow and you just translate it into 100 ALUs? Does it make sense to reuse the ALUs and therefore have 100 instructions map to less than 100 ALUs?

If we assume the above model, then there are specific requirements for the hardware.

What if need more than one algorithm? Can I switch the implementation fast enough? Can the hardware have multiple programmed algorithms for the same ALUs?

Programmable ALUs sound awfully close to what a CPU does but in theory you would just have a register file with say 16 different ALU configurations and the data you are sending through the ALUs is prefixed with a 4 bit opcode that tells the ALU which configuration to use. We are getting further and further away from what an FPGA does and closer to how CPUs work but we still have the concept of programming the hardware for specific algorithms.

These are just random thoughts but they reveal that the idea of a pure FPGA is clearly not what the accelerator market needs.

imtringued · on Dec 24, 2022

>And it takes an insane amounts of software engineering to make a good compute stack

That isn't actually true. It takes commitment to the platform. My AMD GPU from 2017 had its ROCm support dropped in 2021. Very nice. That sends a signal that ROCm isn't worth supporting by application developers. In other words, AMD decided that there are ROCm and gaming GPUs whereas with Nvidia there is no distinction.

Sirened · on Dec 23, 2022

What sort of wiki are you envisioning here? There is some decent tooling and docs around generating SoCs [1] but, as the article mentions, the most difficult part is not creating a single RISCV core but rather creating a very high performance interconnect. This is still an open and rich research area, so you're best source of information is likely to just be google scholar.

But, for what it's worth, there do seem to be some practical considerations why your idea of a hugely parallel computer would not meaningfully rival the M1 (or any other modern processor). The issue that everyone has struggled with for decades now is that lots of tasks are simply very difficult to parallelize. Hardware people would love to be able to just give software N times more cores and make it go N times faster, but that's not how it works. The most famous enunciation of this is Amdahl's Law [2]. So, for most programs people use today, 1024 tiny slow cores may very well be significantly worse than the eight fast, wide cores you can get on an M1.

[1] https://chipyard.readthedocs.io/en/stable/Chipyard-Basics/in...

[2] https://en.wikipedia.org/wiki/Amdahl's_law

imtringued · on Dec 24, 2022

The problem isn't that algorithms are inherently sequential though but rather that parallel programming is a separate discipline.

In single threaded programming you have almost infinite flexibility, you can acquire infinite resources in any arbitrary order. In multithreaded programming you must limit the number of accessible resources and the order is preferably well defined.

In my opinion expecting people to write parallel algorithms is too much, not because it is too difficult but rather because it has to permeate through your entire codebase. That is a nonstarter unless the required changes are not unreasonable.

The challenge then becomes, how do we let people write single threaded programs that can run on multiple cores and gracefully degrade to being single threaded the worse the code is optimized?

I don't have the perfect answer but I think there is an opportunity for a trinity of techniques that can be used in combination: lock hierarchies, STM and the actor model.

There is a unit of parallelism like the actor model that executes code in a sequentially single threaded fashion. Multiple of these units work in parallel, however, rather than communicating through messages, STM is used to optimistically execute code and obtain a runtime heuristic of the acquired locks. If there are no conflicts, then performance scales linearly, if there are conflicts, then by carefully defining a hierarchy of the resources, you can calculate the optimal level in the hierarchy to execute the STM transaction in. This will allow the STM transaction to succeed with a much higher chance which then eliminates the primary downside of STM: performance loss due to failed transactions whose failure rate creeps up the more resources are being acquired.

A lock hierarchy could look like this: object, person, neighborhood, city, state, country, planet.

You write an STM transaction that looks like single threaded code. It loops around all people in the times square and would thereby acquire their locks. However, if that transaction was executed on the object level, it would almost certainly fail because out of thousands of people only one needs to be changed by another transaction to fail. The STM transaction acquired a thousand people, therefore it's optimal level in the hierarchy is the neighborhood. This means only one lock needs to be acquired to process thousands of people. If it turns out that the algorithm needs information like the postal address of these people, it is possible that some of them are tourists and you therefore acquire resources that are all over the world, you might need to execute this transaction at the highest level of the hierarchy for it to finish.

The primary objection would be the dependence on STM for obtaining the necessary information about which resources have been acquired. This means that ideally, all state is immutable to make software transactional memory as cheap as possible to implement. This is not a killer but it means that if you were to water this approach down, then the acquired resources must be static, i.e. known at compile time to remove the need for optimistic execution. That still works and it lets you get away with a lot more single threaded code than you might think, especially legacy code bases.

musicale · on Dec 24, 2022

Shame that OpenSPARC T1 was a dead end. And Open MIPS. etc..

RISC-V is maybe finally something that will stick because it can't be killed by a stupid company.

rjsw · on Dec 23, 2022

You could do a trial build of an in-order Rocket RISC-V core [1] to see how much space it takes up.

[1] https://github.com/chipsalliance/rocket-chip