Seemed like a great project. Hope to see it come back!
There are some great open-source projects in this space – not quite the same – many are focused on local LLMs like Llama2 or Code Llama which was released last week:
The UI is relatively mature, as it predates llama. It includes upstream llama.cpp PRs, integrated AI horde support, lots of sampling tuning knobs, easy gpu/cpu offloading, and its basically dependency free.
yes. at first glance it looks like a windows app but it's actually very portable. it has some parameters for gpu offloading and extended context size that just work. it exposes an api endpoint. i use it on a workstation to serve larger llms locally and like the performance and ease of use.
Ollama is very neat. Given how compressible the models are is there any work being done on using them in some kind of compressed format other than reducing the word size?
There are different levels of quantization available for different models (if that's what you mean :). E.g. here are the versions available for Llama 2: https://ollama.ai/library/llama2/tags which go down to 2-bit quantization (which surprisingly still happens to work reasonably well).
No, what I mean is that it seems as though there is quite a bit of sparseness to the matrix and I was wondering if that can somehow be used to further shrink the model, quantization is another effect (it leaves the shape of the various elements as they are but reduces their bit-depth).
Ah, gotcha! I thought you probably meant something else. I've been wondering this too, and it's something I've been meaning to look at.
On a related note it doesn't seem like many local runners are leveraging techniques like PagedAttention yet (see https://vllm.ai/) which is inspired by operating system memory paging to reduce memory requirements for LLMs.
It's not quite what you mentioned, but it might have a similar effect! Would love to know if you've seen other methods that might help reduce memory requirements.. it's one of the largest resource bottlenecks to running LLMs right now!
That's a clever one, I had not seen that yet, thank you.
The hint for me is that the models compress so well, that suggests the information content is much lower than the size of the uncompressed model indicates which is a good reason to investigate which parts of the model are so compressible and why. I haven't looked at the raw data of these models but maybe I'll give it a shot. Sometimes you can learn a lot about the structure (built in or emergent) of data just by staring at the dumps.
That's quite interesting. I hadn't thought of sparsity in the weights as a way to compress models, although this is an obvious opportunity in retrospect! I started doing some digging and found https://github.com/SqueezeAILab/SqueezeLLM, although I'm sure there's newer work on this idea.
There are some great open-source projects in this space – not quite the same – many are focused on local LLMs like Llama2 or Code Llama which was released last week:
- https://github.com/jmorganca/ollama (download & run LLMs locally - I'm a maintainer)
- https://github.com/simonw/llm (access LLMs from the cli - cloud and local)
- https://github.com/oobabooga/text-generation-webui (a web ui w/ different backends)
- https://github.com/ggerganov/llama.cpp (fast local LLM runner)
- https://github.com/go-skynet/LocalAI (has an openai-compatible api)