Using languages with low(er) complexity parallelism, optimising existing code as...

craigjb · on July 16, 2015

There is also a lot of space between general purpose CPU and completely blank-slate FPGAs. GPUs are essential very wide data word processors (one instruction but a large data vector).

And, there are also configurable pipeline processors that consist of multiple ALUs (can be vector data) with reconfigurable connections between them. So rather than eat the overhead of generic bit LUTs in an FPGA, you reconfigure the interconnect in a fabric of ALUs. The fabric can contain specialized ALUs (or execution units), and varying densities depending on typical usage. This avoids a the technology mapping and place & route of FPGA design. Translating a description of hardware (HDL) into actual lookup table data is a massive compute problem for large modern FPGAs. However, if we collapse the routing to just data buses between compute units, the problem can be solved in real-time. This way, a compute pipeline could be reorganized by an application at run-time without using a full-fledged hardware description language with all kinds of very low-level constructs. Higher-level language compilers already exist for this kind of architecture in academia.

EDIT: The sea of ALUs can also contain memories, FIFOs, and other block elements. However, they would all operate on the same word size to reduce the routing problem and allow maximum density implementation.

In fact, this is the essence of how micro-coded instructions in a modern CPU work anyway--instruction 'scheduling' is basically figuring out how to route data transfers in a sea of execution units.

The article "Fundamental Underpinnings of Reconfigurable Computing Architectures" in the March 2015 Proceedings of the IEEE contains a wonderful introduction to all these concepts.

white-flame · on July 16, 2015

A sea of ALUs with an instruction set to directly route them together, is effectively describing the Mill CPU.

http://millcomputing.com/