Anecdotally one of very few differences between 1979 68000 and 1982 68010 was ad...

crest · on Nov 30, 2024

Much more importantly they fixed the MMU support. The original 68000 lost some state required to recover from a page fault the workaround was ugly and expensive: run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU. Apparently it was still cheaper than the alternatives at the time if you wanted a CPU with MMU, a 32 bit ISA and a 24 bit address bus. Must have been a wild time.

phire · on Dec 1, 2024

> run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU.

That's not quite how it was implemented.

Instead, the second 68000 was halted and disconnected from the bus until the first 68000 (the executor) trigged a fault. Then the first 68000 would be held in halt, disconnected from the bus and the second 68000 (the fixer) would take over the bus to run the fault handler code.

After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.

As for the cost of a second 68000, extra logic and larger PCBs? Well, the of the Motorola 68451 MMU (or equivalent) absolutely dwarfed the cost of everything else, so adding a second CPU really wasn't a big deal.

Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.

For more details, see Motorola's application note here: http://marc.retronik.fr/motorola/68K/68000/Application%20Not...

phire · 2024-12-02T03:46:53 1733111213

Further thoughts:

While this executor + fixer setup does work for most usecases, it's still impossible to recover the state. The relevant state is simply held in the halted 68000.

Which means, the only thing you can do is handle the fault and resume. If you need to page something in from disk, userspace is entirely blocked until the IO request completes. You can't go and run another process that isn't waiting for IO.

I suspect it also makes it impossible to correctly implement POSIX segfault signal handlers. If you try to run it on the executor, then the state is cleared and it's not valid to return from the signal handler anymore.

If you run the handler on the fixer instead, then you are running in a context without pagefaults, which would be disastrous if the segfault handler access code or data that has been paged out. And the now segfault handler wouldn't have access to any of the executor's CPUs state.

------

So there is merit to the idea of running two 68000s in lockstep. That would theoretically allow you to recover the full state.

But there is a problem: It's not enough to run the second 68000 one cycle behind.

You need to run it one instruction behind, putting all memory read data and wait-states into a FIFO for the second 68000 to consume. And 68000 instructions have variable execution time, so I guess the delay needs to be the length of the longest possible instruction (which is something like 60 cycles).

But what about pipelining? That's the whole reason why can't recover the state in the first place. I'm not sure, but it might be necessary to run an entire 4 instructions behind, which would mean something like 240 cycles buffered in that FIFO.

This also means your fault handler is now running way too soon. You will need to emulate 240 cycles worth of instructions in software until you find the one which triggered the page fault.

I think such an approach is possible, but it really doesn't seem sane.

--------

I might need to do a deeper dive into this later, but I suspect all these early dual 68000 Unix workstations simply dealt with the issues of the executor/fixer setup and didn't implement proper segfault signal handlers. It's reasonably rare for programs to do anything in a segfault handler other than print a nice crash message.

Any unix program that did fancy things in segfault handlers weren't portable, as many unix systems didn't have paging at all. It was enough to have a memory mapper with a few segments (base, size, and physical offset).

Dylan16807 · on Dec 1, 2024

That's neat. For small loop buffers, I quite like the GreenArrays forth core. It has 18 bit words that hold 4 instructions each, and one of the opcodes decrements a loop counter and goes back to the start of the word. And it can run appreciably faster while it's doing that.

ack_complete · on Dec 1, 2024

The loop buffer on the 68010 was almost useless, because not only was it only 6 bytes, it only held two instructions. One had to be the loop instruction (DBcc), so the loop body had to be a single instruction. Pretty much the only thing it could speed up in practice was an unoptimized memcpy.

rasz · 2024-12-02T10:59:21 1733137161

>unoptimized memcpy

could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:

- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|

- 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) at hilarious 7 cycles per byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Afaik no loop buffer meant its re-fetching whole instruction every loop.

- 1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s

- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle, 0.31 cycles per byte in practice http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...

- 1995 Pentium Pro "fast string mode" strongly hinted at REP MOVS as the optimal way to copy memory.

- 1997 Pentium MMX 'rep movsd' 0.27 cycles per byte. Mem copy with MMX registers 0.29 cycles per byte.

- 2000 SSE2 optimized copy hack.

- 2008 AVX optimized copy hack at ~full L2/memory bus speed for large enough transfers.

- 2012 Ivy Bridge Enhanced REP MOVSB (ERMSB), but funnily still slower than even the SSE2 variants.

- 2019 Ice Lake Fast Short REP MOVSB (FSRM) still somewhat slower than AVX variants on unaligned accesses.

- 2020 Zen3 FSRM !20 times! slower than AVX unaligned, 30% slower on aligned https://lunnova.dev/articles/ryzen-slow-short-rep-mov/

- 2023 And then Intel got Reptar https://lock.cmpxchg8b.com/reptar.html :)