The speedup figures they report are compared to their own cutlass-based baseline...

ashvardanian · 2025-02-27T12:55:29 1740660929

I generally avoid FP8 and prefer I8, but your question got me wondering how well cuBLAS performs.

First of all, cuBLAS needs the cuBLASLt extension API for mixed-precision workloads to handle FP8. Second, some adequate type combinations, like E5M2 x E5M2 for A x B, are not supported, while others, like E5M2 x E4M3, are! Moreover, matrix A must always come in a transposed layout for Ampere, Hopper, and Blackwell... and the list of constraints goes on.

I've integrated FP8 cuBLASLt benchmarks into my "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp>, adding to the list of existing cuBLAS and hand-rolled CUDA and PTX benchmarks. I'm running them on H200 GPUs, which should have the same performance as H100. For square inputs, the throughput peaks around 1.35 Peta-ops.

  --------------------------------------------------------------------------------------------------
  Benchmark                                        Time             CPU   Iterations UserCounters...
  --------------------------------------------------------------------------------------------------
  cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
  cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
  cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
  cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
  cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
  cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
  cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
  cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
  cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 %

That's around 67% of the advertised number for dense GEMM <https://resources.nvidia.com/en-us-data-center-overview-mc/e...>.

Bimos · 2025-02-27T01:59:40 1740621580

I heard that it is possible to achieve better performance than cuBLAS using CUTLASS? I thought they chose the better one among cuBLAS and CUTLASS as baseline.