Well UUID generation isn’t going to be quite as SIMDable as counting so the analogy breaks down there partially because of that. And += 1 isn’t a very SIMDable operation? Unless I guess you create a mask of +1, +2, +3, +4 and add that to your base number to generate those offsets (which only works with avx512 - avx2 can only do 2 increments since these are 64bit integers)
Then your 32 HT threads aren’t really going to give you full access to the underlying SIMD registers which are going to be per core which is where I assume you realized the 2x difference might show up?
And to do += 1 multithreaded you have to partition the range or you won’t get any speed up - if you don’t amortize the cost of atomic synchronization across threads you’re going to be going slower than a non-SIMD increment.
So assuming you use 64-bit counters, you can divide those 12 years by 1024 to get 4 days.
And that's not even considering what you could do on a GPU.
Edit: I might be off by a factor of 2, not sure if the SIMD throughput is per-core or per-thread. Also thermal throttling. Same ballpark though!