Hi, author here. I crowd-sourced the devices for benchmarking from my friends. I...

ggerganov · 2025-10-14T05:59:34 1760421574

FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |

rajatgupta314 · 2025-10-14T06:41:05 1760424065

Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.

Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...

Example looking at the same weight on Ollama is BF16:

https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360

yvbbrjdr · 2025-10-14T06:03:57 1760421837

I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..

alecco · 2025-10-14T09:01:03 1760432463

Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.

Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.

ilc · 2025-10-14T11:34:02 1760441642

Was. They've been diverging.

xs83 · 2025-10-14T10:34:10 1760438050

Now this looks much more interesting! Is the top one input tokens and the second one output tokens?

So 38.54 t/s on 120B? Have you tested filling the context too?

ggerganov · 2025-10-14T14:56:47 1760453807

Yes, I provided detailed numbers here: https://github.com/ggml-org/llama.cpp/discussions/16578

nialse · 2025-10-14T18:51:37 1760467897

Makes sense you have one of the boxes. What's your take on it? [Respecting any NDAs/etc/etc of course]

__mharrison__ · 2025-10-14T06:22:18 1760422938

Curious to how this compares to running on a Mac.

xs83 · 2025-10-14T10:34:55 1760438095

TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB

Eggpants · 2025-10-15T04:01:01 1760500861

So so many… eBay search shows only 15 results, 6 of them being ads for new systems…

https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...