The speedup figures they report are compared to their own cutlass-based baseline. Has anyone done a performance comparison against cuBLAS?
All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.
I generally avoid FP8 and prefer I8, but your question got me wondering how well cuBLAS performs.
First of all, cuBLAS needs the cuBLASLt extension API for mixed-precision workloads to handle FP8. Second, some adequate type combinations, like E5M2 x E5M2 for A x B, are not supported, while others, like E5M2 x E4M3, are! Moreover, matrix A must always come in a transposed layout for Ampere, Hopper, and Blackwell... and the list of constraints goes on.
I've integrated FP8 cuBLASLt benchmarks into my "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp>, adding to the list of existing cuBLAS and hand-rolled CUDA and PTX benchmarks. I'm running them on H200 GPUs, which should have the same performance as H100. For square inputs, the throughput peaks around 1.35 Peta-ops.
I heard that it is possible to achieve better performance than cuBLAS using CUTLASS? I thought they chose the better one among cuBLAS and CUTLASS as baseline.
All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.