I just released Shimmy v1.7.0 with MoE (Mixture of Experts) CPU offloading support, and the results are pretty exciting for anyone who's hit GPU memory walls.
What this solves
If you've tried running large language models locally, you know the pain: a 42B parameter model typically needs 80GB+ of VRAM, putting it out of reach for most developers. Even "smaller" 20B models often require 40GB+.
The breakthrough
MoE CPU offloading intelligently moves expert layers to CPU while keeping active computation on GPU. In practice:
Phi-3.5-MoE 42B: Runs on 8GB consumer GPUs (was impossible before)
GPT-OSS 20B: 71.5% VRAM reduction (15GB → 4.3GB, measured)
DeepSeek-MoE 16B: Down to 800MB VRAM with Q2 quantization
The tradeoff is 2-7x slower inference, but you can actually run these models instead of not running them at all.
Technical implementation
Built on enhanced llama.cpp bindings with new with_cpu_moe() and with_n_cpu_moe(n) methods
Two CLI flags: --cpu-moe (automatic) and --n-cpu-moe N (manual control)
Cross-platform: Windows MSVC CUDA, macOS Metal, Linux x86_64/ARM64
Still sub-5MB binary with zero Python dependencies
Ready-to-use models
I've uploaded 9 quantized models to HuggingFace specifically optimized for this:
Phi-3.5-MoE variants (Q8.0, Q4 K-M, Q2 K)
DeepSeek-MoE variants
GPT-OSS 20B baseline
Getting started
# Install
cargo install shimmy
# Download a model
huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf
# Run with MoE offloading
./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf
Standard OpenAI-compatible API, so existing code works unchanged.
Why this matters
This democratizes access to state-of-the-art models. Instead of needing a $10,000 GPU or cloud spending, you can run expert models on gaming laptops or modest server hardware.
It's not just about making models "work" - it's about sustainable AI deployment where organizations can experiment with cutting-edge architectures without massive infrastructure investments.
The technique itself isn't novel (llama.cpp had MoE support), but the Rust bindings, production packaging, and curated model collection make it accessible to developers who just want to run large models locally.
Release: https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/... Models: https://huggingface.co/MikeKuykendall
Happy to answer questions about the implementation or performance characteristics.
Great alternatives list! Each serves different use cases:
QtMultimedia: Excellent for C++/Qt developers, but requires Qt framework react-native-vision-camera: Perfect for mobile, but CrabCamera targets desktop OpenCV: Great for computer vision, but heavy for simple camera access
CrabCamera's niche: Rust developers building Tauri desktop apps who want:
Zero Qt dependencies
Native Rust integration
Minimal bundle size
Cross-platform camera control
Different tools for different ecosystems! Currently powering our Budsy plant identification app.
Good point about explaining Tauri better! For context:
Tauri = Rust + Web frontend (like Electron but smaller/faster) Problem: Desktop apps need camera access, but web APIs are limited CrabCamera: Provides native camera control for Tauri desktop apps
Real example: Our Budsy plant identification app uses CrabCamera to capture photos for botanical analysis - something web camera APIs can't do effectively.
Thanks for the feedback on clarity!
ou were absolutely right about WebRTC complexity! Since that feedback, we've refocused CrabCamera on its core mission - desktop camera access for Tauri apps.
Changes made based on your feedback:
Removed WebRTC from core scope
Focused on clean camera capture API
Left streaming protocols to dedicated libraries
Current CrabCamera v0.3.0:
45/45 tests passing
Production-ready in Budsy plant identification app
Clean separation of concerns
Thanks for steering us toward better architecture!
GPU/CUDA: Yes, but disabled by default for faster builds. To enable: remove LLAMA_CUDA = "OFF" from config.toml and rebuild with CUDA toolkit installed.
Rust library: Absolutely! Add shimmy = { version = "0.1.0", features = ["llama"] } to Cargo.toml. Use the inference engine directly:
let engine = shimmy::engine::llama::LlamaEngine::new();
let model = engine.load(&spec).await?;
let response = model.generate("prompt", opts, None).await?;
No need to spawn processes - just import and use the components directly in your Rust code.
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Key differences:
- Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
- Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
- Use case: llama-swap for managing many models, Shimmy for simplicity
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
# Download a model huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf
# Run with MoE offloading ./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf Standard OpenAI-compatible API, so existing code works unchanged. Why this matters This democratizes access to state-of-the-art models. Instead of needing a $10,000 GPU or cloud spending, you can run expert models on gaming laptops or modest server hardware. It's not just about making models "work" - it's about sustainable AI deployment where organizations can experiment with cutting-edge architectures without massive infrastructure investments. The technique itself isn't novel (llama.cpp had MoE support), but the Rust bindings, production packaging, and curated model collection make it accessible to developers who just want to run large models locally. Release: https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/... Models: https://huggingface.co/MikeKuykendall Happy to answer questions about the implementation or performance characteristics.