What is the rationale for so many of these ‘run it locally’ AI ports to run *as ...

jmorgan · on Feb 17, 2024

This is a really interesting question. I think there's definitely a world for both deployment models. Maybe a good analogy is database engines: both SQLite (a library) and Postgres (a long-running service) have widespread use cases with tradeoffs.

jameshart · on Feb 17, 2024

But these are typically filling the usecases of productivity applications, not ‘engines’.

Microsoft Word doesn’t run its grammar checker as an external service and shunt JSON over a localhost socket to get spelling and style suggestions.

Photoshop doesn’t install a background service to host filters.

The closest pattern I can think of is the ‘language servers’ model used by IDEs to handle autosuggest - see https://microsoft.github.io/language-server-protocol/ - but the point of that is to enable many to many interop - multiple languages supporting multiple IDEs. Is that the expected usecase for local language assistants and image generators?

bri3d · on Feb 17, 2024

Funny choice of example. You’ve always been able to use Word as a remote spellchecker over COM, and as of Windows 8, spellchecking is available system wide and runs in a separate process (again over COM) for sandboxing reasons.

JSON over TCP is perhaps a silly IPC mechanism for local services, but this kind of composition doesn’t seem unreasonable to me.

jameshart · on Feb 18, 2024

> use Word as a remote spellchecker over COM

That's not how COM works. You can load Word's spellchecker into your process.

Windows added a spellchecking API in Windows 8. I've not dug into the API in detail, but don't see any indication that spellchecker providers run in a separate process (you can probably build one that works that way, but it's not intrinsic to the provider model).

bri3d · on Feb 18, 2024

Are you not familiar with out of process COM servers? A lot of Office automation is out of process, even inside of Office itself. Admittedly I’m not sure about the grammar checker specifically.

As for the Spellcheck API, external providers are explicitly out of proc: https://learn.microsoft.com/en-us/windows/win32/intl/about-t...

Anyway, my point still stands - building desktop apps using composition over RPC is neither new nor a bad idea, although HTTP might not be the best RPC mechanism (although… neither was COM…)

pseudosavant · on Feb 18, 2024

The language server pattern is actually a very good comparison. The web service + web UI approach enables you do use different local and/or cloud AI services. That is why most of these servers/services support the OpenAI API.

jameshart · on Feb 18, 2024

Which means most of these servers limit themselves to the capabilities exposed by the OpenAI API.

psytrx · on Feb 17, 2024

In addition to the initial loading time noted by the other posters:

You may want to use the same inference engine or even the same LLM for multiple purposes in multiple applications.

Also, which is a huge factor in my opinion, is getting your machine, environment and OS into a state that can't run the models efficiently. It wasn't trivial to me. Putting all this complexity inside a container (and therefore "server") helps tremendously, a) in setting everything up initially and b) keeping up with the constant improvements and updates that are happening regularly.

mattnewton · on Feb 17, 2024

It doesn’t make sense to load the weights on the fly- that is gigabits of memory that has to be shuffled around. Instead, you have a long running process that serves up lots of predictions

(edit: someday soon, probably to multiple clients too!)

nightfly · on Feb 17, 2024

So better to have GiBs of memory consumed by it constantly?

mattnewton · on Feb 17, 2024

If you don’t have that memory to spare you can’t run this locally anyways, and keeping it in memory is the only way to have a fast experience. Paying the model loading cost repeatedly sucks.

jameshart · on Feb 17, 2024

Why would linking llama.cpp into a UI application lead to incurring the model loading cost repeatedly?

mattnewton · on Feb 17, 2024

It would be loaded repeatedly if the ui is opened and closed repeatedly. You can achieve the same “long running server + short running ui window” with multiple threads or processes all linked into one binary if you want of course. This way (with a separate server) seems simpler to me (and has the added benefit that multiple applications could easily call into the “server” if needed)

jameshart · on Feb 17, 2024

Local UI applications are long running processes normally

imiric · on Feb 17, 2024

This is a good thing IMO. I don't have a very powerful laptop or workstation, but do have a multi-GPU headless server. These projects allow me to experiment with LLMs on my server, and expose an API and web UI to my LAN.

lolinder · on Feb 17, 2024

In addition to everything that everyone else has said: I run Ollama on a large gaming PC for speed but want to be able to use the models from elsewhere in the house. So I run Open-WebUI at chat.domain.example and Ollama at api.chat.domain.example (both only accessible within my local network).

With this setup I can use my full-speed local models from both my laptop and my phone with the web UI, and my raspberry pi that's running my experimental voice assistant can query Ollama through the API endpoints, all at the full speed enabled by my gaming GPU.

The same logic goes for my Stable Diffusion setup.

vunderba · on Feb 18, 2024

Because it adds flexibility. By decoupling the frontend from the backend it's much easier for other devs not directly affiliated with the server repo (e.g. Ollama) to design new frontends that can connect to it.

I also think it allows experts to focus on what they are good at. Some people have a really keen eye for aesthetics and can design amazing front and experiences, and some people are the exact opposite and prefer to work on the backend.

Additionally, since it runs as a server, I can place it on a powerful headless machine that I have and can access that easily from significantly less powerful devices such as my phone and laptop.

justsomehnguy · on Feb 17, 2024

> I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?

Because it's now a simple REST-like query to interact with that server.

Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.

jameshart · on Feb 17, 2024

I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.

Aren’t people mostly running browser frontends in front of these to provide a persistent UI - a chat interface or an image workspace or something?

sure, if you’re running a lot of little command line tools that need access to an LLM a server makes sense but what I don’t understand is why that isn’t a niche way of distributing these things - instead it seems to be the default.

justsomehnguy · on Feb 18, 2024

> I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.

Did you ever used a computer?

    PS C:\Users\Administrator\AppData\Local\Programs\Ollama> ./ollama.exe run llama2:7b "say hello" --verbose
    Hello! How can I help you today?

    total duration:       35.9150092s
    load duration:        1.7888ms
    prompt eval duration: 1.941793s
    prompt eval rate:     0.00 tokens/s
    eval count:           10 token(s)
    eval duration:        16.988289s
    eval rate:            0.59 tokens/s

But I feel like you are here just to troll around without a merit or a target.

jameshart · on Feb 18, 2024

If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:

./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”

On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.

You can run LLM models right inside a short lived process.

But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.

Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.

Last run of llama.cpp main off my command line:

   llama_print_timings:        load time =     871.43 ms
   llama_print_timings:      sample time =      20.39 ms /   259 runs   (    0.08 ms per token, 12702.31 tokens per second)
   llama_print_timings: prompt eval time =     397.77 ms /     3 tokens (  132.59 ms per token,     7.54 tokens per second)
   llama_print_timings:        eval time =   20079.05 ms /   258 runs   (   77.83 ms per token,    12.85 tokens per second)
   llama_print_timings:       total time =   20534.77 ms /   261 tokens

sgt101 · on Feb 17, 2024

Because running it locally really means running it on a cloud server that you own and is called by other server that you own. This gives you the ability to make the interfaces lightweight and most importantly to not pay premiums to model servers.

jameshart · on Feb 17, 2024

No, running it locally means running it on my laptop.

My Mac M2 is quite capable of running stable diffusion XL models and 30M parameter. LLMs under llama.cpp.

What I don’t like is the trend towards the way to do that being to open up network listeners with no authentication on them.

sgt101 · on Feb 19, 2024

>What I don’t like is the trend towards the way to do that being to open up network listeners with no authentication on them.

Yeah - but don't do that.

The thing about small models that can run on commodity hardware is that it breaks the business model of OpenAI and co. They hope that they can run a service that charges a fortune but provides functionality that can't be duplicated. This gives them a moat and a huge revenue engine. Quantized models and student models (trained from the big models outputs) show that the moat is likely to be transitory or partial at best. We can run Mistral 7B at about 1/300th of the cost of a call to GPT4. That makes a whole load of applications viable, but it also torpedoes the monopoly pricing model that they are hoping for.

All we need to do now is to stop people training on stolen data.

teaearlgraycold · on Feb 17, 2024

Bind to localhost then

karolist · on Feb 18, 2024

lmstudio my friend

api · on Feb 17, 2024

The main reason I see is to use the same AI engine for multiple things like VSCode plugins, UI apps, etc.

That being said I use LM Studio which runs as a UI and allows you to start a local server for coding and editor plugins.

I can run Deepseek Coder in VSCode locally on an M1 Max and it’s actually useful. It’ll just eat the battery quickly if it’s not plugged in since it really slams the GPU. It’s about the only thing I use that will make the M1 make audible fan noise.

andersa · on Feb 18, 2024

I personally find it very useful, because it allows me to run the inference server on a powerful remote server while running the UI locally on a laptop or tablet.

Kuinox · on Feb 18, 2024

I'll probably uses that because the Rust binding to llamacpp doesn't works on windows (well, cpu only works, so not usable). Python is broken (can't install the deps)

Also mind that loading theses models take dozens of seconds, and you can only load one at a time on your machine, so if you have multiple programs that want to run theses models, it make sense to delegate this job to another program that the user can control.

ijustlovemath · on Feb 18, 2024

Not mentioned yet: you can "mitm" existing APIs, like OpenAI, so that you can use existing applications with Ollama without changing your code.

Really clever, IMO! I was also mystified by the choice until I saw that use case.

taneq · on Feb 18, 2024

You have a beefy computer with lots of vram for testing locally, and then once that’s running you want to use the same thing from other computers or from web servers etc. that can’t run the models themselves.

mvkel · on Feb 19, 2024

Wouldn't the opposite be some kind of Electron-type situation, where each locally-running app gets their own discrete instance spun up? Sounds slow and unnecessary.

kaliqt · on Feb 17, 2024

Heavy compute. Often you might need to outsource the model to another PC and also because it's heavy compute and general models, multiple apps use the same model at the same time.

imtringued · on Feb 18, 2024

You want electron? That is how you get electron!