We used CPU intrinsics in version 9 of GNU coreutils, for the cksum utility. From the release notes:
"cksum [-a crc] is now up to 4 times faster by using a slice by 8 algorithm, and at least 8 times faster where pclmul instructions are supported."
Implementing that portably is a bit tricky, as one must consider:
- support various compilers which may not support intrinsics
- runtime checks to see if the current CPU supports the instructions
- ensure compiler options enabling the instructions are restricted to their own lib to ensure the don't leak into unprotected code.
- automake requires using a separate lib for this rather than just a separate compilation unit
"cksum [-a crc] is now up to 4 times faster by using a slice by 8 algorithm, and at least 8 times faster where pclmul instructions are supported."
Implementing that portably is a bit tricky, as one must consider:
BTW we also introduced avx intrinsics for `wc -l`