This is a really interesting question. I think there's definitely a world for both deployment models. Maybe a good analogy is database engines: both SQLite (a library) and Postgres (a long-running service) have widespread use cases with tradeoffs.
But these are typically filling the usecases of productivity applications, not ‘engines’.
Microsoft Word doesn’t run its grammar checker as an external service and shunt JSON over a localhost socket to get spelling and style suggestions.
Photoshop doesn’t install a background service to host filters.
The closest pattern I can think of is the ‘language servers’ model used by IDEs to handle autosuggest - see https://microsoft.github.io/language-server-protocol/ - but the point of that is to enable many to many interop - multiple languages supporting multiple IDEs. Is that the expected usecase for local language assistants and image generators?
Funny choice of example. You’ve always been able to use Word as a remote spellchecker over COM, and as of Windows 8, spellchecking is available system wide and runs in a separate process (again over COM) for sandboxing reasons.
JSON over TCP is perhaps a silly IPC mechanism for local services, but this kind of composition doesn’t seem unreasonable to me.
That's not how COM works. You can load Word's spellchecker into your process.
Windows added a spellchecking API in Windows 8. I've not dug into the API in detail, but don't see any indication that spellchecker providers run in a separate process (you can probably build one that works that way, but it's not intrinsic to the provider model).
Are you not familiar with out of process COM servers? A lot of Office automation is out of process, even inside of Office itself. Admittedly I’m not sure about the grammar checker specifically.
Anyway, my point still stands - building desktop apps using composition over RPC is neither new nor a bad idea, although HTTP might not be the best RPC mechanism (although… neither was COM…)
The language server pattern is actually a very good comparison. The web service + web UI approach enables you do use different local and/or cloud AI services. That is why most of these servers/services support the OpenAI API.
In addition to the initial loading time noted by the other posters:
You may want to use the same inference engine or even the same LLM for multiple purposes in multiple applications.
Also, which is a huge factor in my opinion, is getting your machine, environment and OS into a state that can't run the models efficiently. It wasn't trivial to me. Putting all this complexity inside a container (and therefore "server") helps tremendously, a) in setting everything up initially and b) keeping up with the constant improvements and updates that are happening regularly.
It doesn’t make sense to load the weights on the fly- that is gigabits of memory that has to be shuffled around. Instead, you have a long running process that serves up lots of predictions
(edit: someday soon, probably to multiple clients too!)
If you don’t have that memory to spare you can’t run this locally anyways, and keeping it in memory is the only way to have a fast experience. Paying the model loading cost repeatedly sucks.
It would be loaded repeatedly if the ui is opened and closed repeatedly. You can achieve the same “long running server + short running ui window” with multiple threads or processes all linked into one binary if you want of course. This way (with a separate server) seems simpler to me (and has the added benefit that multiple applications could easily call into the “server” if needed)
This is a good thing IMO. I don't have a very powerful laptop or workstation, but do have a multi-GPU headless server. These projects allow me to experiment with LLMs on my server, and expose an API and web UI to my LAN.
In addition to everything that everyone else has said: I run Ollama on a large gaming PC for speed but want to be able to use the models from elsewhere in the house. So I run Open-WebUI at chat.domain.example and Ollama at api.chat.domain.example (both only accessible within my local network).
With this setup I can use my full-speed local models from both my laptop and my phone with the web UI, and my raspberry pi that's running my experimental voice assistant can query Ollama through the API endpoints, all at the full speed enabled by my gaming GPU.
The same logic goes for my Stable Diffusion setup.
Because it adds flexibility. By decoupling the frontend from the backend it's much easier for other devs not directly affiliated with the server repo (e.g. Ollama) to design new frontends that can connect to it.
I also think it allows experts to focus on what they are good at. Some people have a really keen eye for aesthetics and can design amazing front and experiences, and some people are the exact opposite and prefer to work on the backend.
Additionally, since it runs as a server, I can place it on a powerful headless machine that I have and can access that easily from significantly less powerful devices such as my phone and laptop.
> I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?
Because it's now a simple REST-like query to interact with that server.
Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.
I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.
Aren’t people mostly running browser frontends in front of these to provide a persistent UI - a chat interface or an image workspace or something?
sure, if you’re running a lot of little command line tools that need access to an LLM a server makes sense but what I don’t understand is why that isn’t a niche way of distributing these things - instead it seems to be the default.
If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.
Last run of llama.cpp main off my command line:
llama_print_timings: load time = 871.43 ms
llama_print_timings: sample time = 20.39 ms / 259 runs ( 0.08 ms per token, 12702.31 tokens per second)
llama_print_timings: prompt eval time = 397.77 ms / 3 tokens ( 132.59 ms per token, 7.54 tokens per second)
llama_print_timings: eval time = 20079.05 ms / 258 runs ( 77.83 ms per token, 12.85 tokens per second)
llama_print_timings: total time = 20534.77 ms / 261 tokens
Because running it locally really means running it on a cloud server that you own and is called by other server that you own. This gives you the ability to make the interfaces lightweight and most importantly to not pay premiums to model servers.
>What I don’t like is the trend towards the way to do that being to open up network listeners with no authentication on them.
Yeah - but don't do that.
The thing about small models that can run on commodity hardware is that it breaks the business model of OpenAI and co. They hope that they can run a service that charges a fortune but provides functionality that can't be duplicated. This gives them a moat and a huge revenue engine. Quantized models and student models (trained from the big models outputs) show that the moat is likely to be transitory or partial at best. We can run Mistral 7B at about 1/300th of the cost of a call to GPT4. That makes a whole load of applications viable, but it also torpedoes the monopoly pricing model that they are hoping for.
All we need to do now is to stop people training on stolen data.
The main reason I see is to use the same AI engine for multiple things like VSCode plugins, UI apps, etc.
That being said I use LM Studio which runs as a UI and allows you to start a local server for coding and editor plugins.
I can run Deepseek Coder in VSCode locally on an M1 Max and it’s actually useful. It’ll just eat the battery quickly if it’s not plugged in since it really slams the GPU. It’s about the only thing I use that will make the M1 make audible fan noise.
I personally find it very useful, because it allows me to run the inference server on a powerful remote server while running the UI locally on a laptop or tablet.
I'll probably uses that because the Rust binding to llamacpp doesn't works on windows (well, cpu only works, so not usable).
Python is broken (can't install the deps)
Also mind that loading theses models take dozens of seconds, and you can only load one at a time on your machine, so if you have multiple programs that want to run theses models, it make sense to delegate this job to another program that the user can control.
You have a beefy computer with lots of vram for testing locally, and then once that’s running you want to use the same thing from other computers or from web servers etc. that can’t run the models themselves.
Wouldn't the opposite be some kind of Electron-type situation, where each locally-running app gets their own discrete instance spun up? Sounds slow and unnecessary.
Heavy compute. Often you might need to outsource the model to another PC and also because it's heavy compute and general models, multiple apps use the same model at the same time.
Have developers forgotten that it’s actually possible to run code inside your UI process?
We see the same thing with stable diffusion runners as well as LLM hosts.
I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?