LLMs on GPUs have a lot of computational inefficiencies and untapped parallelism. GPUs have been designed for more diverse workloads with much smaller working sets. LLM inference is ridiculously DRAM-bound. We currently have 10×-200× too much compute available compared to the DRAM bandwidth required. Even without improvements in transistors we can get more efficient hardware for LLMs.
The way we use LLMs is also primitive and inefficient. RAG is a hack, and in most LLM architectures the RAM cost grows quadratically with the context length, in a workload that is already DRAM-bound, on a hardware that already doesn't have enough RAM.
> Depending on US tariffs […] end of fossil fuels […] global supply chain
It does look pretty bleak for the US.
OTOH China is rolling out more than a gigawatt of renewables a day, has the largest and fastest growing HVDC grid, a dominant position in battery and solar production, and all the supply chains. With the US going back to mercantilism and isolationism, China is going to have Taiwan too.
The way we use LLMs is also primitive and inefficient. RAG is a hack, and in most LLM architectures the RAM cost grows quadratically with the context length, in a workload that is already DRAM-bound, on a hardware that already doesn't have enough RAM.
> Depending on US tariffs […] end of fossil fuels […] global supply chain
It does look pretty bleak for the US.
OTOH China is rolling out more than a gigawatt of renewables a day, has the largest and fastest growing HVDC grid, a dominant position in battery and solar production, and all the supply chains. With the US going back to mercantilism and isolationism, China is going to have Taiwan too.