I've spent a lot of time writing python ML code, and I think 99% (of time inside...

I've spent a lot of time writing python ML code, and I think 99% (of time inside fast C/cuda code) is much higher than most programs achieve. Off hand I'd say 80% is average and 90% is good (usually comes after a performance refactor). Also, one aspect that people often overlook is that optimizing compilers can often combine or vectorize operations in ways that are an order of magnitude faster than the same operations performed consecutively in, say, pure numpy. in that scenario, even though each numpy operation is very fast for what it is, the python program that combines a bunch of them ends up running much slower.