Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Presuming your CPU 1. is pipelined, and 2. has a clock frequency higher than the memory bus speed, wouldn’t each memory fetch be guaranteed to take more than 1 CPU cycle (let’s say X cycles) to resolve? If the static CPU-side execute-phase overhead for each load instruction is M cycles, wouldn’t the total load time just be M+XN — because the static phase of each successive pipelined op is occurring while the previous ops are waiting around for RAM to get back to them?

Sort of like serially launching N async XHRs. The “serially” part doesn’t really matter, since the time of each task is dominated by the latency of the remote getting back to you; so you may as well think of it as launching them in parallel.

The only practical difference I can see is that LDM results in smaller code and so increases cache-coherency.



LDM is from a world where single cycle memory access without caches is a thing. Whether that's classic systems like the ARM1 when DRAM was actually faster than CPU clocks at the time (you could even run multiple CPUs out of phase with each other and just have them both hit the same DRAM), or embedded system like the gameboy advance or Cortex-M cores where main memory can (but doesn't have to be) be single cycle access SRAM or NOR flash.


At the time of the ARM1 and ARM2, DRAM typically had a 120ns cycle time (tho you could get 100ns DRAM). The Archimedes machines ran at 8MHz so they ran at the limit of the DRAM bandwidth. There were some early prototypes which could run at 12MHz but that required fast DRAM and careful component selection or the computer would not run reliably - look for “fast A500” at http://chrisacorns.computinghistory.org.uk/Computers/A500.ht...

Other CPUs were not as good as the ARM at using the available DRAM bandwidth.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: