A 100× transistor count would amount to basically 6 to 7 doublings of Moore's la...

imtringued · on Dec 24, 2022

100x transistor count is a big enough difference that you would want some sort of ALU and branching/FSM/loop unit arranged in an array with a few fpga elements on inputs and outputs.

It sounds to me that the real problem is still that the ideal programming model hasn't been created yet.

What would it look like? Compile a C function into an FPGA pipeline? A dataflow language where you explicitly define processes and their implementation?

I mean imagine if you could take a mathematical formula, how would it translate into a series of additions, subtractions and multiplications? You could write it down in Fortran and then you would have a well defined and ordered tree of operations. Do we just translate that tree into hardware? Like, you have 100 instructions with no control flow and you just translate it into 100 ALUs? Does it make sense to reuse the ALUs and therefore have 100 instructions map to less than 100 ALUs?

If we assume the above model, then there are specific requirements for the hardware.

What if need more than one algorithm? Can I switch the implementation fast enough? Can the hardware have multiple programmed algorithms for the same ALUs?

Programmable ALUs sound awfully close to what a CPU does but in theory you would just have a register file with say 16 different ALU configurations and the data you are sending through the ALUs is prefixed with a 4 bit opcode that tells the ALU which configuration to use. We are getting further and further away from what an FPGA does and closer to how CPUs work but we still have the concept of programming the hardware for specific algorithms.

These are just random thoughts but they reveal that the idea of a pure FPGA is clearly not what the accelerator market needs.