Hacker Newsnew | past | comments | ask | show | jobs | submit | MKuykendall's commentslogin

I just released Shimmy v1.7.0 with MoE (Mixture of Experts) CPU offloading support, and the results are pretty exciting for anyone who's hit GPU memory walls. What this solves If you've tried running large language models locally, you know the pain: a 42B parameter model typically needs 80GB+ of VRAM, putting it out of reach for most developers. Even "smaller" 20B models often require 40GB+. The breakthrough MoE CPU offloading intelligently moves expert layers to CPU while keeping active computation on GPU. In practice: Phi-3.5-MoE 42B: Runs on 8GB consumer GPUs (was impossible before) GPT-OSS 20B: 71.5% VRAM reduction (15GB → 4.3GB, measured) DeepSeek-MoE 16B: Down to 800MB VRAM with Q2 quantization The tradeoff is 2-7x slower inference, but you can actually run these models instead of not running them at all. Technical implementation Built on enhanced llama.cpp bindings with new with_cpu_moe() and with_n_cpu_moe(n) methods Two CLI flags: --cpu-moe (automatic) and --n-cpu-moe N (manual control) Cross-platform: Windows MSVC CUDA, macOS Metal, Linux x86_64/ARM64 Still sub-5MB binary with zero Python dependencies Ready-to-use models I've uploaded 9 quantized models to HuggingFace specifically optimized for this: Phi-3.5-MoE variants (Q8.0, Q4 K-M, Q2 K) DeepSeek-MoE variants GPT-OSS 20B baseline Getting started # Install cargo install shimmy

# Download a model huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf

# Run with MoE offloading ./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf Standard OpenAI-compatible API, so existing code works unchanged. Why this matters This democratizes access to state-of-the-art models. Instead of needing a $10,000 GPU or cloud spending, you can run expert models on gaming laptops or modest server hardware. It's not just about making models "work" - it's about sustainable AI deployment where organizations can experiment with cutting-edge architectures without massive infrastructure investments. The technique itself isn't novel (llama.cpp had MoE support), but the Rust bindings, production packaging, and curated model collection make it accessible to developers who just want to run large models locally. Release: https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/... Models: https://huggingface.co/MikeKuykendall Happy to answer questions about the implementation or performance characteristics.


Side note I built a 2500 star and climbing rust application from stem to stern using it

https://github.com/Michael-A-Kuykendall/shimmy


Great alternatives list! Each serves different use cases: QtMultimedia: Excellent for C++/Qt developers, but requires Qt framework react-native-vision-camera: Perfect for mobile, but CrabCamera targets desktop OpenCV: Great for computer vision, but heavy for simple camera access CrabCamera's niche: Rust developers building Tauri desktop apps who want: Zero Qt dependencies Native Rust integration Minimal bundle size Cross-platform camera control Different tools for different ecosystems! Currently powering our Budsy plant identification app.


Good point about explaining Tauri better! For context: Tauri = Rust + Web frontend (like Electron but smaller/faster) Problem: Desktop apps need camera access, but web APIs are limited CrabCamera: Provides native camera control for Tauri desktop apps Real example: Our Budsy plant identification app uses CrabCamera to capture photos for botanical analysis - something web camera APIs can't do effectively. Thanks for the feedback on clarity!


ou were absolutely right about WebRTC complexity! Since that feedback, we've refocused CrabCamera on its core mission - desktop camera access for Tauri apps. Changes made based on your feedback: Removed WebRTC from core scope Focused on clean camera capture API Left streaming protocols to dedicated libraries Current CrabCamera v0.3.0: 45/45 tests passing Production-ready in Budsy plant identification app Clean separation of concerns Thanks for steering us toward better architecture!


This is a crate I am using for another application, I thought it was neato


Thx good call!


GPU/CUDA: Yes, but disabled by default for faster builds. To enable: remove LLAMA_CUDA = "OFF" from config.toml and rebuild with CUDA toolkit installed.

Rust library: Absolutely! Add shimmy = { version = "0.1.0", features = ["llama"] } to Cargo.toml. Use the inference engine directly:

let engine = shimmy::engine::llama::LlamaEngine::new(); let model = engine.load(&spec).await?; let response = model.generate("prompt", opts, None).await?;

No need to spawn processes - just import and use the components directly in your Rust code.


Try cargo install or intentionally exclude, unsigned Rust binaries will do this.


Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.

  Key differences:
  - Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
  - Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
  - Use case: llama-swap for managing many models, Shimmy for simplicity


Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: