Was hoping it was so easy :) But I probably need to look into it some more. llam...

adrian_b · 2026-04-16T11:14:51 1776338091

As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.

There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.

The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.

Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.

Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.

So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.

roosgit · 2026-04-16T08:40:05 1776328805

I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.

teekert · 2026-04-16T08:50:40 1776329440

Ah it was indeed just that!

I'm now on:

$ llama --version version: 8770 (82764d8) built with GNU 15.2.0 for Linux x86_64

(From Nix unstable)

And this works as advertised, nice chat interface, but no openai API I guess, so no opencode...

homarp · 2026-04-16T09:00:46 1776330046

check on same port, there is an OpenAI API https://github.com/ggml-org/llama.cpp/tree/master/tools/serv...

teekert · 2026-04-16T09:33:47 1776332027

Good stuff, thanx!

zozbot234 · 2026-04-16T08:45:42 1776329142

And that's exactly why llama.cpp is not usable by casual users. They follow the "move fast and break things" model. With ollama, you just have to make sure you're getting/building the latest version.

Eisenstein · 2026-04-16T10:20:51 1776334851

Its not possible to run the latest model architectures without 'moving fast'. The only thing broken here is that they are trying to use an old version with a new model.

cyanydeez · 2026-04-16T10:38:36 1776335916

and Ollama suffered the same fate when wanting to try new models

Eisenstein · 2026-04-16T13:31:58 1776346318

What fate?

cyanydeez · 2026-04-16T17:18:20 1776359900

the impedance mismatch between when models are released and the capability of Ollama and other servers capability for use.

Eisenstein · 2026-04-16T18:17:26 1776363446

I'm a bit unsure what that has to do with someone running an outdated version of the program while trying to use a model that is supported in the latest release.