> single-threaded 16-bit integer code might get 0.125 operations per clock cycle on the C64
Merely adding two 16-bit integers on the 6510 takes 14 cycles on fixed zero page locations for operands and destination. You'll easily spend 150 cycles on a general 16x16 multiplication using look-up tables. Not even measuring the juggling of values into and out of these fixed zero page locations via three 8-bit registers or the stack, we're talking about something like a tenth of your estimated op/s for an even mix of additions and multiplications. So I'd say much closer to a million times faster for a use case like this than 96000.
There may be special cases where the 6510 achieves 0.125 16-bit operations per second, for example multiplying by constant two and adding constant one (10 and 6 cycles, respectively)
16-bit singe threaded integer code seems like a rather contrived example as well. We're after all typically not running 16-bit applications over MS-DOS on our monster machines. Just booting into my OS will result in all cores being used to execute a bunch of 32/64 bit operations.
It would be interesting to see something like a modern cryptographic hashing algorithm implemented on a 6502 and compare performance both on long messages and on many smaller messages. This should give us an idea of how much slower a 6502 is at integer operations.
I very much appreciate the corrections, particularly from someone who knows the architecture so much better than I do. Most days I regret commenting on HN, but today is not one of those days.
To clarify, by "16-bit integer code" I meant code that doesn't use floating-point, in which most of the arithmetic is done on 16-bit values, not code for a 16-bit integer machine like the 8086 or code consisting entirely of 16-bit arithmetic operations. My reason for picking 16-bit was that most of my integers fit into 16 bits, but often not 8 bits. Usually arithmetic that needs more than 16 bits of precision is address arithmetic, and on the 6502 (or 6510) that's often handled by the 8-bit X and Y registers. Even multiplies are much less common than addition, which in turn is less common than MOV. And of course jumps, calls, returns, and 8-bit-index loops (inx; bne :-) suffer comparatively less slowdown than the actual 16-bit operations in the 16-bit integer code, and they usually constitute the majority of it.
I agree that cryptographic algorithms routinely do very wide arithmetic. They want as many ALU bits as they can get their little anarchist hands on. But I think they are atypical in this.
When I look at the applications running on the computers sitting around me, most of the things they're doing seem like they would fit well into the 16-bit integer arithmetic bucket, so I don't think it's contrived. The way they're doing those things (in JS, with JIT compilers, using floating point, dynamic typing, and garbage collection) is tailored to the bigger machines we usually use, but the applications (text editing, spreadsheets, networking, arguing with strangers who know more than I do, text layout, font rendering, previously futures trading) are not. The big exceptions here are 3-D graphics and video codecs, which want ALUs as wide as possible, just like crypto algorithms.
Also, modern xeons can do 32 dual precision floating point ops per cycle, per core. Since they have dozens of cores, that’s another factor of 1000, with your 10x overhead estimate for the C64, that brings the speedup to ~ a billion. (96,000 * (10 to 150x) * 1000 ~= 1-10 billion)
Merely adding two 16-bit integers on the 6510 takes 14 cycles on fixed zero page locations for operands and destination. You'll easily spend 150 cycles on a general 16x16 multiplication using look-up tables. Not even measuring the juggling of values into and out of these fixed zero page locations via three 8-bit registers or the stack, we're talking about something like a tenth of your estimated op/s for an even mix of additions and multiplications. So I'd say much closer to a million times faster for a use case like this than 96000.
There may be special cases where the 6510 achieves 0.125 16-bit operations per second, for example multiplying by constant two and adding constant one (10 and 6 cycles, respectively)
16-bit singe threaded integer code seems like a rather contrived example as well. We're after all typically not running 16-bit applications over MS-DOS on our monster machines. Just booting into my OS will result in all cores being used to execute a bunch of 32/64 bit operations.
It would be interesting to see something like a modern cryptographic hashing algorithm implemented on a 6502 and compare performance both on long messages and on many smaller messages. This should give us an idea of how much slower a 6502 is at integer operations.