Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looks like nice code. I have always liked ARM assembler for its cleanliness. But who does a lot of assembler programming? Those who do might as well be horrified by the cycles needed. So I am not sure that nice code is paramount for the success of the instruction set.


Compared to the alternative (up to 16 load instructions separately, with all of their associated fetch bandwidth), the number of cycles it takes looks really good. It's a way of amortizing the overhead of the instruction across all of the loads being issued.


Presuming your CPU 1. is pipelined, and 2. has a clock frequency higher than the memory bus speed, wouldn’t each memory fetch be guaranteed to take more than 1 CPU cycle (let’s say X cycles) to resolve? If the static CPU-side execute-phase overhead for each load instruction is M cycles, wouldn’t the total load time just be M+XN — because the static phase of each successive pipelined op is occurring while the previous ops are waiting around for RAM to get back to them?

Sort of like serially launching N async XHRs. The “serially” part doesn’t really matter, since the time of each task is dominated by the latency of the remote getting back to you; so you may as well think of it as launching them in parallel.

The only practical difference I can see is that LDM results in smaller code and so increases cache-coherency.


LDM is from a world where single cycle memory access without caches is a thing. Whether that's classic systems like the ARM1 when DRAM was actually faster than CPU clocks at the time (you could even run multiple CPUs out of phase with each other and just have them both hit the same DRAM), or embedded system like the gameboy advance or Cortex-M cores where main memory can (but doesn't have to be) be single cycle access SRAM or NOR flash.


At the time of the ARM1 and ARM2, DRAM typically had a 120ns cycle time (tho you could get 100ns DRAM). The Archimedes machines ran at 8MHz so they ran at the limit of the DRAM bandwidth. There were some early prototypes which could run at 12MHz but that required fast DRAM and careful component selection or the computer would not run reliably - look for “fast A500” at http://chrisacorns.computinghistory.org.uk/Computers/A500.ht...

Other CPUs were not as good as the ARM at using the available DRAM bandwidth.


> up to 16 load instructions separately

That's not the only alternative. Here's AVX2 code that copies 32 bytes with 2 instructions:

    vmovdqu ymm0, ymmword ptr[rsi]
    vmovdqu ymmword ptr[rdi], ymm0
 
Pretty sure ARM NEON can do something similar.


The point isn't the number of bytes transferred, but being able to dump the integer register file to/from main memory quickly (and issue an indirect jump at the same time!)

Also, the gate count where these instructions really shine don't tend to have vector units.


But NEON is optional, not available in many "truly" embedded implementations.


Not optional in AArch64 though.


So is AVX2.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: