In my case just CPU (it's a Hetzner server, checked in /proc/cpuinfo and it said "AMD EPYC 9454P 48-Core Processor"). I apparently had still in terminal backlog some stats, so I pasted below.
It's not a speed demon but enough to mess around and test things out. Thinking can sometimes be pretty long so it can take a while to get responses, even if 6 tokens/sec is pretty good considering pure CPU setup.
---
prompt eval time = 133.55 ms / 1 tokens ( 133.55 ms per token, 7.49 tokens per second)
eval time = 392205.46 ms / 2220 tokens ( 176.67 ms per token, 5.66 tokens per second)
total time = 392339.02 ms / 2221 tokens
(IIRC slot save path argument does absolutely nothing unless and is superfluous, but I have been pasting a similar command around and been too lazy to remove it). -ctk q8_0 reduces memory use a bit for context.
I think my 256gb is right at the limit of spilling a bit into swap, so I'm pushing the limits :)
The --min-p 0.1 was a recommendation from Unsloth page; I think because the quant is going so low in bits, some things may start to misbehave and it is a mitigation. But I haven't messed around enough to say how true that is, or any nuance about it. I think I put --temp 0.6 for the same reason.
To explain to anyone not aware of llama-server: it exposes (a somewhat) OpenAI-compatible API and then you can use it with any software that speaks that. llama-server itself also has a UI, but I haven't used it.
I had some SSH tunnels set up to use the server interface with https://github.com/oobabooga/text-generation-webui where I hacked an "OpenAI" client to it (that UI doesn't have it natively). The only reason I use the oobabooga UI is out of habit so I don't recommend this setup to others.