Looks like nice code. I have always liked ARM assembler for its cleanliness. But...

monocasa · on Oct 15, 2020

Compared to the alternative (up to 16 load instructions separately, with all of their associated fetch bandwidth), the number of cycles it takes looks really good. It's a way of amortizing the overhead of the instruction across all of the loads being issued.

derefr · on Oct 15, 2020

Presuming your CPU 1. is pipelined, and 2. has a clock frequency higher than the memory bus speed, wouldn’t each memory fetch be guaranteed to take more than 1 CPU cycle (let’s say X cycles) to resolve? If the static CPU-side execute-phase overhead for each load instruction is M cycles, wouldn’t the total load time just be M+XN — because the static phase of each successive pipelined op is occurring while the previous ops are waiting around for RAM to get back to them?

Sort of like serially launching N async XHRs. The “serially” part doesn’t really matter, since the time of each task is dominated by the latency of the remote getting back to you; so you may as well think of it as launching them in parallel.

The only practical difference I can see is that LDM results in smaller code and so increases cache-coherency.

monocasa · on Oct 15, 2020

LDM is from a world where single cycle memory access without caches is a thing. Whether that's classic systems like the ARM1 when DRAM was actually faster than CPU clocks at the time (you could even run multiple CPUs out of phase with each other and just have them both hit the same DRAM), or embedded system like the gameboy advance or Cortex-M cores where main memory can (but doesn't have to be) be single cycle access SRAM or NOR flash.

fanf2 · on Oct 15, 2020

At the time of the ARM1 and ARM2, DRAM typically had a 120ns cycle time (tho you could get 100ns DRAM). The Archimedes machines ran at 8MHz so they ran at the limit of the DRAM bandwidth. There were some early prototypes which could run at 12MHz but that required fast DRAM and careful component selection or the computer would not run reliably - look for “fast A500” at http://chrisacorns.computinghistory.org.uk/Computers/A500.ht...

Other CPUs were not as good as the ARM at using the available DRAM bandwidth.

Const-me · on Oct 15, 2020

> up to 16 load instructions separately

That's not the only alternative. Here's AVX2 code that copies 32 bytes with 2 instructions:

    vmovdqu ymm0, ymmword ptr[rsi]
    vmovdqu ymmword ptr[rdi], ymm0

Pretty sure ARM NEON can do something similar.

monocasa · on Oct 15, 2020

The point isn't the number of bytes transferred, but being able to dump the integer register file to/from main memory quickly (and issue an indirect jump at the same time!)

Also, the gate count where these instructions really shine don't tend to have vector units.

usr1106 · on Oct 15, 2020

But NEON is optional, not available in many "truly" embedded implementations.

floatboth · on Oct 15, 2020

Not optional in AArch64 though.

saagarjha · on Oct 15, 2020

So is AVX2.