Hacker Newsnew | past | comments | ask | show | jobs | submit | more roosgit's commentslogin

I can answer question 3. Prompt processing (how fast your input is parsed) is highly correlated with computing speed. Inference (how fast the LLM answers) is highly correlated with memory bandwidth. So a good CPU might read your question faster, but it will answer pretty much as slow as a cheap CPU with the same RAM.

I have a Ryzen 3 4100. Just tested Qwen2.5-Coder-32B-Instruct-Q3_K_S.gguf with llama.cpp.

CPU-only:

54.08 t/s prompt eval

2.69 t/s inference

---

CPU + 52/65 layers offloaded to GPU (RTX 3060 12GB):

166.79 t/s prompt eval

6.62 t/s inference


Renting could be a good choice to get started. I used to rent a g4dn.xlarge instance from AWS (for Stable Diffusion, not LLMs). More affordable options are Runpod and Vast.ai.

I started with a local system using llama.cpp on CPU alone and for short questions and answers it was OK for me. Because (in 2023) I didn't know if LLMs would be any good, I chose cheap components https://news.ycombinator.com/item?id=40267208.

Since AWS was getting pretty expensive, I also bought an RTX 3060(16GB), an extra 16GB RAM (for a total of 32GB) and a superfast 1TB M.2 SSD. The total cost of the components was around €620.

Here are some basic LLM performance numbers for my system:

https://news.ycombinator.com/item?id=41845936

https://news.ycombinator.com/item?id=42843313


You can find even more affordable + reliable cloud GPU options on Shadeform (YC S23).

It's a GPU marketplace that lets you compare and deploy on-demand instances from big names like Lambda, Scaleway, Crusoe, etc. with a single account.

Super useful for finding the best pricing per GPU type and deploying.

There's H100s for under $2 an hour, and H200s for under $3 an hour. Lots of lighter GPU options too (ex: A5000 for $0.25/hr)


Start with r/LocalLLama and r/StableDiffusion. Look for benchmarks for various GPUs.

I have an RTX 3060(12GB) and 32GB RAM. Just ran Qwen2.5-14B-Instruct-Q4_K_M.gguf in llama.cpp with flash attention enabled and 8K context. I get get 845t/s for prompt processing and 25t/s for generation.

For a while I even ran llama.cpp without a GPU (don't recommend it for diffusion) and with the same model (Qwen2.5 14B) I would get 11t/s for processing and 4t/s for generation. Acceptable for chats with short questions/instructions and answers.


How rich?

You can get some inspiration from businesses for sale on Empire Flippers https://empireflippers.com/marketplace/.

As a rule of thumb for choosing the niche, pick from one of these https://support.google.com/admob/answer/3150953?hl=en


I have a separate PC that I access through SSH. I recently bought a GPU for it, before that I was running it on CPU alone.

- B550MH motherboard

- Ryzen 3 4100 CPU

- 32GB (2x16) RAM cranked up to 3200MHz (prompt generation in memory bound)

- 256GB M.2 NVMe (helps with loading models faster)

- Nvidia 3060 12GB

Software-wise, I use llamafile because on the CPU it's faster by 10-20% for prompt processing than llama.cpp.

Performance "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf":

CPU-only: 23.47 t/s (processing), 8.73 t/s (generation)

GPU: 941.5 t/s (processing), 29.4 t/s (generation)


I've never used it, but I think Google Colab has a free plan.

As another option, you can rent a machine with a decent GPU on vast.ai. An Nvidia 3090 can be rented for about $0.20/hr.


I think Louie Mantia was an icon designer at Apple back then https://lmnt.me/. Maybe Sebastiaan de With as well https://sdw.space/.


I use it to help me write text.

Don't use any tools. I run it from the command line:

./main -f ~/Desktop/prompts/multishot/llama3-few-shot-prompt-10.txt -m ~/Desktop/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf --temp 0 --color -c 1024 -n -1 --repeat_penalty 1.2 -tb 8 --log-disable 2>/dev/null

I prefer `main` to the new `llama-cli` because when searching history for "llama" I want to get commands that contain the "llama" models, not "mistral" ones, for example.


I had a similar thing happen to one of my websites. In Varnish I used something like this:

if (req.http.host ~ "^(?i)(example.com|www.example.com)") { #redirect to https } else { return(synth(403, "Not allowed.")); }

It basically checks if the host is my domain. I don’t know know what the equivalent of `req.http.host` is on the web server you use. This "solution" might run into issues with Google Translate, but I’m not sure.


I noticed problems with it, as well.


maybe it's time to consider some other analytics tools.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: