So, it's otherwise automatic, except I just have to write a selector routine that tries to decide the best performing routine to run at runtime and implementation for each individual case with varying hardware support.
You only need to go to all that trouble if you want high performance across a variety of machines. If you are merely after bragging rights or trying to satisfy someone else's requirement, the theory is that you can compile the exact same piece of high level code using different optimization targets, and the compiler will do all the work for you, providing maximum performance for each instruction set practically for free...
Even more practically, Agner has a typically excellent description of the strengths and weaknesses of the dispatch strategies used by different compilers in Section 13 (p. 122) here: http://www.agner.org/optimize/optimizing_cpp.pdf