*> single-threaded 16-bit integer code might get 0.125 operations per clock cycl...

kragen · on Dec 11, 2021

I very much appreciate the corrections, particularly from someone who knows the architecture so much better than I do. Most days I regret commenting on HN, but today is not one of those days.

To clarify, by "16-bit integer code" I meant code that doesn't use floating-point, in which most of the arithmetic is done on 16-bit values, not code for a 16-bit integer machine like the 8086 or code consisting entirely of 16-bit arithmetic operations. My reason for picking 16-bit was that most of my integers fit into 16 bits, but often not 8 bits. Usually arithmetic that needs more than 16 bits of precision is address arithmetic, and on the 6502 (or 6510) that's often handled by the 8-bit X and Y registers. Even multiplies are much less common than addition, which in turn is less common than MOV. And of course jumps, calls, returns, and 8-bit-index loops (inx; bne :-) suffer comparatively less slowdown than the actual 16-bit operations in the 16-bit integer code, and they usually constitute the majority of it.

I agree that cryptographic algorithms routinely do very wide arithmetic. They want as many ALU bits as they can get their little anarchist hands on. But I think they are atypical in this.

When I look at the applications running on the computers sitting around me, most of the things they're doing seem like they would fit well into the 16-bit integer arithmetic bucket, so I don't think it's contrived. The way they're doing those things (in JS, with JIT compilers, using floating point, dynamic typing, and garbage collection) is tailored to the bigger machines we usually use, but the applications (text editing, spreadsheets, networking, arguing with strangers who know more than I do, text layout, font rendering, previously futures trading) are not. The big exceptions here are 3-D graphics and video codecs, which want ALUs as wide as possible, just like crypto algorithms.

hedora · on Dec 11, 2021

Also, modern xeons can do 32 dual precision floating point ops per cycle, per core. Since they have dozens of cores, that’s another factor of 1000, with your 10x overhead estimate for the C64, that brings the speedup to ~ a billion. (96,000 * (10 to 150x) * 1000 ~= 1-10 billion)

kragen · on Dec 11, 2021

Yes, SIMD operations on CPUs are in many ways similar to GPU computation.