This is not an area of expertise for me, so is there a reason to not offload vector processing to the GPU and devote the CPU silicon to what it's good at, which is scalar instructions?
There are many reasons. The latency of getting data back and forth to the GPU is a pretty high threshold to cross before you even see benefits, and many tasks are still CPU bound because they have data dependencies and logic that benefit from good branch prediction and deep pipelines.
Many high compute tasks are CPU bound. GPUs are only good for lots of dumb math that doesn't change a lot. Turns out that only applies to a small set of problems, so you need to put in lots of effort to turn your problem into lots of dumb math instead of a little bit of smart math and justify the penalty for leaving L1.
Yes, communications overhead. SIMD instructions in the CPU have direct access to all the same registers and data as regular instructions. Moving data to a GPU and back is a very expensive operation relative to that. The chips are just physically further away and have to communicate mostly via memory.
Consider a typical use case for SIMD instructions - you just decrypted an image or bit of audio downloaded over SSL and want to process it for rendering. The data is in the CPU caches already. SIMD will munch it.