More

mythz · 2026-02-23T11:59:38 1771847978

Servo isn't a JS engine. Do you mean why didn't they abandon their mission statement of developing a truly independent browser engine from scratch, abandon their C++ code base they spent the last 5 years building, accept a regression hit on WPT test coverage, so they can start hacking on a completely different complex foreign code-base they have no experience in, that another team is already developing?

mythz · 2026-02-20T14:32:36 1771597956

I consider HuggingFace more "Open AI" than OpenAI - one of the few quiet heroes (along with Chinese OSS) helping bring on-premise AI to the masses.

I'm old enough to remember when traffic was expensive, so I've no idea how they've managed to offer free hosting for so many models. Hopefully it's backed by a sustainable business model, as the ecosystem would be meaningfully worse without them.

We still need good value hardware to run Kimi/GLM in-house, but at least we've got the weights and distribution sorted.

data-ottawa · 2026-02-20T14:46:24 1771598784

Can we toss in the work unsloth does too as an unsung hero?

They provide excellent documentation and they’re often very quick to get high quality quants up in major formats. They’re a very trustworthy brand.

disiplus · 2026-02-20T15:20:28 1771600828

Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.

danielhanchen · 2026-02-20T23:03:26 1771628606

Haha for now our primary goal is to expand the market for local AI and educate people on how to do RL, fine-tuning and running quants :)

WanderPanda · 2026-02-21T03:47:18 1771645638

Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).

On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)

I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.

danielhanchen · 2026-02-21T12:08:06 1771675686

Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!

jychang · 2026-02-21T11:29:48 1771673388

> How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

Very hard. $$$

The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.

danielhanchen · 2026-02-21T12:08:54 1771675734

Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!

illusive4080 · 2026-02-21T12:55:42 1771678542

Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?

danielhanchen · 2026-02-22T09:58:44 1771754324

Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.

Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.

We could run them after a model release which might work as well.

This is also on 1 benchmark.

Zetaphor · 2026-02-21T05:11:11 1771650671

This would be amazing

danielhanchen · 2026-02-21T12:09:01 1771675741

Working on it! :)

arcanemachiner · 2026-02-20T17:58:44 1771610324

I hope that is exactly what is happening. It benefits them, and it benefits us.

swyx · 2026-02-21T02:11:12 1771639872

not that unsung! we've given them our biggest workshop spot every single year we've been able to and will do until they are tired of us https://www.youtube.com/@aiDotEngineer/search?query=unsloth

danielhanchen · 2026-02-21T12:09:26 1771675766

Appreciate it immensely haha :) Never tired - always excited and pumped for this year!

danielhanchen · 2026-02-20T23:01:32 1771628492

Oh thank you - appreciate it :)

cubie · 2026-02-20T14:56:39 1771599399

I'm a big fan of their work as well, good shout.

danielhanchen · 2026-02-20T23:01:41 1771628501

Thank you!

Tepix · 2026-02-20T16:03:38 1771603418

It's insane how much traffic HF must be pushing out of the door. I routinely download models that are hundreds of gigabytes in size from them. A fantastic service to the sovererign AI community.

razster · 2026-02-20T18:47:40 1771613260

My fear is that these large "AI" companies will lobby to have these open source options removed or banned, growing concern. I'm not sure how else to explain how much I enjoy using what HF provides, I religiously browse their site for new and exciting models to try.

culi · 2026-02-20T19:19:20 1771615160

ModelScope is the Chinese equivalent of Hugging Face and a good back up. All the open models are Chinese anyways

thot_experiment · 2026-02-20T22:26:21 1771626381

Not true! Mistral is really really good, but I agree that there isn't a single decent open model from the USA.

culi · 2026-02-20T22:38:41 1771627121

Mistral is cool and I wish them success but it consistently ranks extremely low on benchmarks while still being expensive. Chinese models like DeepSeek might rank almost as low as Mistral but they are significantly cheaper. And Kimi is the best of both worlds with incredible benchmark results while still being incredibly cheap

I know things change rapidly so I'm not counting them out quite yet but I don't see them as a serious contender currently

thot_experiment · 2026-02-21T03:14:43 1771643683

Sure, benchmarks are fake and I use Mistral over equivalently sized models most of the time because it's better in real life. It runs plenty fast for me, I don't pay for inference.

BoredomIsFun · 2026-02-21T09:40:07 1771666807

> it consistently ranks extremely low on benchmarks

As general purpose chatbots small Mistral models are better than comparably sized Chiniese models, as they have better SimpleQA scores and general knowledge of Western culture.

seanmcdirmid · 2026-02-21T15:18:18 1771687098

It’s really hard to beat qwen coder, especially for role play where the instruction following is really useful. I don’t think their corpus is lacking in western knowledge, although I wonder if Chinese users get even better results from it?

BoredomIsFun · 2026-02-21T15:34:29 1771688069

> It’s really hard to beat qwen coder, for role play

I am not sure if you actually tried that. Mistrals are widely asccepted go-to models for roleplay and creative writing. No Qwens are good at prose, except for their latest big Qwen 3.5.

> I don’t think their corpus is lacking in western knowledge,

It absolutely does, especially pop culture knowledge.

seanmcdirmid · 2026-02-21T16:37:21 1771691841

Instruct and coder just follow instructions so well though. I guess I’ve just never been able to make mistral work well, I guess.

BoredomIsFun · 2026-02-21T17:44:40 1771695880

Qwen3 30B A3B and that big 400+ B Coder were absolutely terrible at editing fiction. I would tell them what to change in the prose and they'd just regurgitate text with no changes.

seanmcdirmid · 2026-02-22T01:03:19 1771722199

Did you try asking Gemini what model to use and how to configure/set it up? It has worked wonders for me, ironically (since I’m using a big model to setup smaller local models).

BoredomIsFun · 2026-02-22T07:04:39 1771743879

> Did you try asking Gemini what model to use and how to configure/set it up?

That would besuboptimal, as Gemini has too old knowledge cutoff. I am long past the need for such an advice anyway, as I've been using local models since mid 2024.

seanmcdirmid · 2026-02-22T11:34:12 1771760052

Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on. Actually, I’m a bit mortified that not everyone knows this. If you ask Gemini (from the search interface) about a current event that happened yesterday, they will use search to pull in context and work with that. Also about model that was released yesterday, it can do that.

It’s only a very low level model access where search isn’t used. Local models also need to be configured to use search, and I haven't had a use case to do that yet.

Gemini seems to call this “grounding with google search”. If you have Gemini installed in your enterprise, it will also search internal data sources for context.

BoredomIsFun · 2026-02-22T19:22:03 1771788123

> Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on.

If decides to do so, and even then baked in knowledge would influence the result.

In any case I do not need Gemini or any other LLMs to figure out setting for my llama.cpp, thank you very much.

seanmcdirmid · 2026-02-22T20:26:26 1771791986

It has always searched the web for me, and it can give me pretty good guidance about a model released in the last week. All models ATM are trying to reduce dependence on internal knowledge mostly through RAG. Anyways, this part of LLMs has gotten much better in the last 6 months.

If you are able to figure out the right settings for a model Thats was released last week, then great for you! But it sounds like you just don’t trust LLMs to use current knowledge, and have some misconception about how they satisfy recent knowledge requests.

Eupolemos · 2026-02-21T01:01:06 1771635666

Why are you talking price when we are talking local AI?

That doesn't make any sense to me. Am I missing something?

dirasieb · 2026-02-21T08:00:23 1771660823

15 missed calls from your local power company

culi · 2026-02-21T07:06:18 1771657578

Your electricity is free?

seanmcdirmid · 2026-02-21T15:18:57 1771687137

Apple silicon is crazy efficient as well as being comparable to GPUs in performance for max and ultra chips.

cpburns2009 · 2026-02-21T14:12:47 1771683167

If you have the hardware to run expensive models, is the cost of electricity much of a factor? According to Google, the average price in the Silicon Valley Area is $0.448 per kWh. An RTX 5090 costs about $4,000 and has a peak power consumption of 1000 W. Maxing out that GPU for a whole year would cost $3,925 at that rate. It's not particularly more expensive than that hardware itself.

culi · 2026-02-21T18:15:55 1771697755

At that point it'd be cheaper to get an expensive subscription to a cloud platform AI product. I understand the case for local LLMs but it seems silly to worry about pricing for cloud-based offerings but not worry about pricing for locally run models. Especially since running it locally can often be more expensive

thot_experiment · 2026-02-21T20:21:23 1771705283

for almost the entire year, yes.

ac29 · 2026-02-22T18:13:38 1771784018

Arcee is working on that, see a blog post about their newest in progress model here: https://www.arcee.ai/blog/trinity-large

Its still not fully post trained and its a non-reasoning model, but its worth keeping an eye on if you dont want to use the Chinese models that currently are the best open-weight options.

CamperBob2 · 2026-02-21T04:33:02 1771648382

To be fair there are lots of worse models than OpenAI's GPT-OSS-120b. It's not a standout when positioned next to the latest releases from China, but prior to the current wave it was considered one of the stronger local models you can reasonably run.

throwaway27448 · 2026-02-20T22:52:21 1771627941

They can try. I don't think they'll be able to get the toothpaste back in the tube. The data will just move our of the country.

seanmcdirmid · 2026-02-21T15:14:15 1771686855

Many of the models on hugging face are already Chinese. It’s kind of obvious that local AI is going to flourish more in China than the USA due to hardware constraints.

dotancohen · 2026-02-21T01:06:54 1771636014

How do you choose which models to try for which workflows? Do you have objective tests that you run, or do you just get a feel for them while using them in your daily workflow?

toofy · 2026-02-21T12:15:09 1771676109

it’s only a matter of time. we have all seen first hand how … wrong … these companies behave, almost on a regular basis.

there’s a small tinfoil hat part of me that suspects part of their obscene investments and cornering the hardware market is driven by an conscious attempt to stop open source local from taking off. they want it all, the money, the control, and to be the only source of information to us.

Onavo · 2026-02-20T19:31:40 1771615900

Bandwidth is not that expensive. The Big 3 clouds just want to milk customers via egress. Look at Hetzner or CloudFlare R2 if you want to get get an idea of commodity bandwidth costs.

vardalab · 2026-02-20T17:48:11 1771609691

Yup, I have downloaded probably a terabyte in the last week, especially with the Step 3.5 model being released and Minimax quants. I wonder what my ISP thinks. I hope they don't cut me off. They gave me a fast lane, they better let me use it, lol

fc417fc802 · 2026-02-20T20:01:09 1771617669

Even fairly restrictive data caps are in the range of 6 Tb per month. P2P at a mere 100 Mb works out to 1 TiB per 24 hours.

Hypothetically my ISP will sell me unmetered 10 Gb service but I wonder if they would actually make good on their word ...

3eb7988a1663 · 2026-02-21T01:40:37 1771638037

I have a 1.2TB cap before you start getting charged extra, so you might need to recalibrate your restrictive level.

fc417fc802 · 2026-02-21T02:08:05 1771639685

Is that with a WISP by chance? Or in a developing country? Or are there really wired providers with such low caps in the western world in this day and age?

Zetaphor · 2026-02-21T05:08:56 1771650536

ATT once told me if I don't pay for their TV service then my home gigabit fiber would have a 1TB cap. They had an agreement with the apartment building so I had no other choice of provider.

fc417fc802 · 2026-02-21T05:41:35 1771652495

Buy our off brand netflix or else we'll make it so you can't watch netflix. How is that legal?

Zetaphor · 2026-02-21T17:45:04 1771695904

The law is written by the highest bidder, and the telecom lobbyists are very generous

nagaiaida · 2026-02-21T03:58:34 1771646314

well it's my wired cap a stone's throw from buildings with google cloud logos on the side in a major us city, so...

zargon · 2026-02-21T04:27:48 1771648068

Comcast.

zozbot234 · 2026-02-20T14:40:26 1771598426

> We still need good value hardware to run Kimi/GLM in-house

If you stream weights in from SSD storage and freely use swap to extend your KV cache it will be really slow (multiple seconds per token!) but run on basically anything. And that's still really good for stuff that can be computed overnight, perhaps even by batching many requests simultaneously. It gets progressively better as you add more compute, of course.

Aurornis · 2026-02-20T18:32:16 1771612336

> it will be really slow (multiple seconds per token!)

This is fun for proving that it can be done, but that's 100X slower than hosted models and 1000X slower than GPT-Codex-Spark.

That's like going from real time conversation to e-mailing someone who only checks their inbox twice a day if you're lucky.

zozbot234 · 2026-02-21T12:10:22 1771675822

You'd need real rack-scale/datacenter infrastructure to properly match the hosted models that are keeping everything in fast VRAM at all times, and then you only get reasonable utilization on that by serving requests from many users. The ~100X slower tier is totally okay for experimentation and non-conversational use cases (including some that are more agentic-like!), and you'd reach ~10X (quite usable for conversation) by running something like a good homelab.

HPsquared · 2026-02-20T15:07:33 1771600053

At a certain point the energy starts to cost more than renting some GPUs.

vardalab · 2026-02-20T17:48:59 1771609739

Yeah, that is hard to argue with because I just go to OpenRouter and play around with a lot of models before I decide which ones I like. But there's something special about running it locally in your basement

dotancohen · 2026-02-21T01:09:53 1771636193

I'd love to hear more about this. How do you decide that you like a model? For which use cases?

fc417fc802 · 2026-02-20T20:04:54 1771617894

Aren't decent GPU boxes in excess of $5 per hour? At $0.20 per kWhr (which is on the high side in the US) running a 1 kW workstation 24/7 would work out to the same price as 1 hour of GPU time.

The issue you'll actually run into is that most residential housing isn't wired for more than ~2kW per room.

sowbug · 2026-02-20T15:27:04 1771601224

Why doesn't HF support BitTorrent? I know about hf-torrent and hf_transfer, but those aren't nearly as accessible as a link in the web UI.

embedding-shape · 2026-02-20T16:02:09 1771603329

> Why doesn't HF support BitTorrent?

Harder to track downloads then. Only when clients hit the tracker would they be able to get download states, and forget about private repositories or the "gated" ones that Meta/Facebook does for their "open" models.

Still, if vanity metrics wasn't so important, it'd be a great option. I've even thought of creating my own torrent mirror of HF to provide as a public service, as eventually access to models will be restricted, and it would be nice to be prepared for that moment a bit better.

sowbug · 2026-02-20T16:13:06 1771603986

I thought of the tracking and gate questions, too, when I vibed up an HF torrent service a few nights ago. (Super annoying BTW to have to download the files just to hash the parts, especially when webseeds exist.) Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

It's a bit like any legalization question -- the black market exists anyway, so a regulatory framework could bring at least some of it into the sunlight.

embedding-shape · 2026-02-20T16:19:01 1771604341

> Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

But that'll only stop a small part, anyone could share the infohash and if you're using the dht/magnet without .torrent files or clicks on a website, no one can count those downloads unless they too scrape the dht for peers who are reporting they've completed the download.

fc417fc802 · 2026-02-20T20:11:57 1771618317

> unless they too scrape the dht for peers who are reporting they've completed the download.

Which can be falsified. Head over to your favorite tracker and sort by completed downloads to see what I mean.

sowbug · 2026-02-20T16:25:36 1771604736

Right, but that's already happening today. That's the black-market point.

Barbing · 2026-02-21T19:20:15 1771701615

That would be a very nice service. I think folks might rely on it for a number of reasons, including that we'll want to see how biases changed over time. What got sloppier, shillier...

jimbob45 · 2026-02-20T19:38:05 1771616285

Wouldn’t it still provide massive benefits if they could convince/coerce their most popular downloaded models to move to torrenting?

intrasight · 2026-02-21T11:48:48 1771674528

Benefit to you, but great downside to the three letter agencies that inject their goods into these models.

homarp · 2026-02-20T18:00:30 1771610430

how are all the private trackers tracking ratios?

taminka · 2026-02-20T17:52:08 1771609928

most of the traffic is probably from open weights, just seed those, host private ones as is

Fin_Code · 2026-02-20T16:02:41 1771603361

I still don't know why they are not running on torrent. Its the perfect use case.

heliumtera · 2026-02-20T16:04:49 1771603489

How can you be the man in the middle in a truly P2P environment?

freedomben · 2026-02-20T16:05:17 1771603517

That would shut out most people working for big corp, which is probably a huge percentage of the user base. It's dumb, but that's just the way corp IT is (no torrenting allowed).

zozbot234 · 2026-02-20T16:08:51 1771603731

It's a sensible option, even when not everyone can really use it. Linux distros are routinely transfered via torrent, so why not other massive, open-licensed data?

freedomben · 2026-02-20T16:16:47 1771604207

Oh as an option, yeah I agree it makes a ton of sense. I just would expect a very, very small percentage of people to use the torrent over the direct download. With Linux distros, the vast majority of downloads still come from standard web servers. When I download distro images I opt for torrents, but very few people do the same

Const-me · 2026-02-20T21:03:30 1771621410

> very small percentage of people to use the torrent over the direct download

BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.

On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.

I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.

zrm · 2026-02-20T16:49:01 1771606141

With Linux distros they typically put the web link right on the main page and have a torrent available if you go look for it, because they want you to try their distro more than they want to save some bandwidth.

Suppose HF did the opposite because the bandwidth saved is more and they're not as concerned you might download a different model from someone else.

thot_experiment · 2026-02-20T22:30:40 1771626640

I have terabytes of linux isos I got via torrents, many such cases!

mythz · 2026-02-16T01:39:42 1771205982

Hikaru is either in a slump or his skill is starting to age: hasn't won Titled Tuesday since November, hasn't won Freestyle Friday this year, came last in Speed Chess Championship, etc.

We'll see how well he does in Candidates this year to see if he's still a top contender. Although I do believe this is his last chance to fight for the world title.

traes · 2026-02-16T02:25:35 1771208735

To be clear, "came last in Speed Chess Championship" actually means he came in 4th out of 16. He still made it to the semifinals. Even then he barely lost to Alireza, who is pretty universally considered a top 3 speed chess player. The loss to Lazavik was a lot worse, but it was still a close match against a strong player. He hasn't won a Titled Tuesday this year but he hasn't scored worse than 8/11 and he's still made the top 10. That's not as much of a slump as you imply IMO.

mythz · 2026-02-16T02:46:23 1771209983

Sure he's still one of the top players, but he's not as strong this year and OP is suggesting he still has an edge against the GOAT, who this year:

- Has won Freestyle WC

- Has won SCC

- Has won 2x Titled Tuesday's

- Has won a Freestyle Friday

Hikaru can snipe a win off Magnus here and there, but I don't think there's any time control or format where he could win a long series of chess matches against Magnus.

Jean-Papoulos · 2026-02-16T07:47:41 1771228061

He could win bullet. No increment means his years of streaming bullet will let an edge when moving in the endgame, so he just needs to draw out the game long enough to get Carlsen either to 0 or in trouble. Somehow we got a chess format where mechanics matter :)

traes · 2026-02-16T11:12:09 1771240329

His record in bullet in the Speed Chess Championship against Carlsen is rather unremarkable, although that is 1+1. Perhaps he would fair better at 1+0.

FreakLegion · 2026-02-16T02:47:40 1771210060

It isn't a slump at all, really. He had his first kid in December. He's preparing for the Candidates in March. Weekly chess.com tournaments are just, you know, going to be relegated to streaming content for a bit.

Pay08 · 2026-02-16T08:14:55 1771229695

Isn't Nakamura the best bullet chess player?

traes · 2026-02-16T10:54:29 1771239269

He's up there for sure, but not clearly the best. According to him both he and Magnus think Alireza Firouzja is the best in longer matches of multiple bullet games.[0] I suspect he would give the edge to Magnus in a shorter match, but I haven't found evidence for this.

[0]: https://www.youtube.com/watch?v=yKXV9-dTq1I&t=2674s

TheSilva · 2026-02-16T10:40:45 1771238445

Also he became a father[0] around that time. Everything adds.

[0] https://en.wikipedia.org/wiki/Hikaru_Nakamura#Personal_life

mythz · 2026-02-13T10:57:39 1770980259

BearSSL by Thomas Pornin is always worth checking in on, not sure what the current status is but looks like it received a commit last year.

[1] https://bearssl.org

jorams · 2026-02-13T11:12:13 1770981133

BearSSL is really cool, but it claims beta quality with the latest release in 2018, doesn't support TLS 1.3, and hasn't seen meaningful development in years. It's averaging about 1 commit per year recently, and they're not big ones.

embedding-shape · 2026-02-14T00:34:56 1771029296

Where is Bellard when we need him?

mananaysiempre · 2026-02-14T00:40:34 1771029634

Most relevantly here, selling a commercial implementation of ASN.1: https://bellard.org/ffasn1/.

mythz · 2026-02-12T17:15:49 1770916549

Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kimi K2.5 when deep English analysis is needed.

Not self-hosting yet, but I prefer using Chinese OSS models for AI workflows because of the potential to self-host in future if needed. Also using it to power my openclaw assistant since IMO it has the best balance of speed, quality and cost:

> It costs just $1 to run the model continuously for an hour at 100 tokens/sec. At 50 tokens/sec, the cost drops to $0.30.

algo_trader · 2026-02-12T18:47:42 1770922062

> MiniMax first in my AI workflows, GLM for code tasks and Kimi K2.5

Its good to have these models to keep the frontier labs honest! Can i ask if you use the API or a monthly plan? Do the monthly plan throttle/reset ?

edit: i agree that MM2.1 most economic, and K2.5 generally the strongest

mythz · 2026-02-12T19:01:15 1770922875

Using a coding plan, haven't noticed any throttling and very happy with the performance. They publish the quotas for each of their plans on their website [1]:

- $10/mo: 100 prompts / 5 hours

- $20/mo: 300 prompts / 5 hours

- $50/mo: 1000 prompts / 5 hours

[1] https://platform.minimax.io/docs/guides/pricing-coding-plan

miroljub · 2026-02-12T19:23:39 1770924219

They count one prompt as 15 requests. That gives you exactly 1500 API requests for 5 hours. Tokens are not counted.

user2722 · 2026-02-12T17:35:51 1770917751

!!!!!! Incredibly cheap!!!!!

I'll have to look for it in OpenRouter.

amunozo · 2026-02-12T17:49:26 1770918566

For the moment is free in Opencode, if you want ot try it.

mythz · 2026-02-11T14:46:36 1770821196

Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].

I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.

[1] https://news.ycombinator.com/item?id=46974878

[2] https://agent.minimax.io

andai · 2026-02-12T10:21:01 1770891661

My perspective aligns with this: I used to obsess over the Best Model, which I defined as "top of benchmarks", which also meant Biggest, Slowest and Most Expensive.

Then I gave two models a Real World Task.

The "Best" model took 3x longer to complete it, and cost 10x more. [0]

Now I define Best Model as "the smallest, fastest, cheapest one that can get the job done". (Currently happy with GLM-4.7 on Cerebras, at least I would be if the unlimited plan wasn't sold out ;)

I later expanded this principle when model speed crossed into the Interactive domain. Speed is not merely a feature; a sufficient difference in speed actually produces a completely new category of usage.

[0] We recently arrived at an approximation of AGI which is "put a lossy solver in an until-done loop". For most tasks we're throwing stuff at a wall to see what sticks, and the smaller models throw faster.

mythz · 2026-02-11T14:22:16 1770819736

We're getting high quality drops for the perfect trifecta of leading models from the Chinese models with the same day release of GLM-5 and Kimi K2.5 1T model drop a few days ago.

Despite having many great options, I end up making use of all 3. MiniMax is my fast workhorse for tool calling and getting quick responses. GLM for all coding tasks whilst Kimi K2.5 1T model has deep knowledge for everything else, with an Opus-level command of the english language. There's been many times where I've preferred Kimi K2.5 over Opus.

tosh · 2026-02-11T14:32:46 1770820366

I'm curious what is coming from Qwen as well. February is starting strong already.

mythz · 2026-02-11T14:40:56 1770820856

Yep the Qwen team has been churning out models for basically everything, and lets not sleep the big blue whale that started it all which is rumored to have a 1M context drop coming soon [1]

[1] https://www.reddit.com/r/LocalLLaMA/comments/1r1snhv/deepsee...

mythz · 2026-02-11T14:14:56 1770819296

They announced it on twitter [1]:

> A new model is now available on http://chat.z.ai.

Looks like that's all they can handle atm:

> User traffic has increased tenfold in a very short time. We’re currently scaling to handle the load.

[1] https://x.com/Zai_org/status/2021564343029203032

mythz · 2026-02-11T14:12:31 1770819151

It's looking like we'll have Chinese OSS to thank for being able to host our own intelligence, free from the whims of proprietary megacorps.

I know it doesn't make financial sense to self-host given how cheap OSS inference APIs are now, but it's comforting not being beholden to anyone or requiring a persistent internet connection for on-premise intelligence.

Didn't expect to go back to macOS but they're basically the only feasible consumer option for running large models locally.

btbuildem · 2026-02-11T14:46:51 1770821211

> doesn't make financial sense to self-host

I guess that's debatable. I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

And this does not even account for privacy and availability. I'm in Canada, and as the US is slowly consumed by its spiral of self-destruction, I fully expect at some point a digital iron curtain will go up. I think it's prudent to have alternatives, especially with these paradigm-shattering tools.

jsheard · 2026-02-11T14:58:34 1770821914

I think AI may be the only place you could get away with calling a 2x350W GPU rig "modest".

That's like ten normal computers worth of power for the GPUs alone.

bigyabai · 2026-02-11T17:45:46 1770831946

> That's like ten normal computers worth of power for the GPUs alone.

Maybe if your "computer" in question is a smartphone? Remember that the M3 Ultra is a 300w+ chip that won't beat one of those 3090s in compute or raster efficiency.

jsheard · 2026-02-11T17:54:00 1770832440

I wouldn't class the M3 Ultra as a "normal" computer either. That's a big-ass workstation. I was thinking along the lines of a typical Macbook or Mac Mini or Windows laptop, which are fine for 99% of anyone who isn't looking to play games or run gigantic AI models locally.

bigyabai · 2026-02-11T17:55:50 1770832550

Those aren't "normal" computers, either. They're iPad chips running in the TDP envelope of a tablet, usually with iPad-level performance to match.

dymk · 2026-02-11T17:35:51 1770831351

That's maybe a few dollars to tens of dollars in electricity per month depending on where in the US you live

GeorgeOldfield · 2026-02-12T07:36:56 1770881816

the upfront cost

kataklasm · 2026-02-11T15:50:41 1770825041

Did you even try to read and understand the parent comment? They said they regularly run out of quota on the exact subscription you're advising they subscribe to.

h3half · 2026-02-11T16:00:26 1770825626

Pot, kettle

wongarsu · 2026-02-11T15:19:29 1770823169

Self-hosting training (or gaming) makes a lot of sense, and once you have the hardware self-hosting inference on it is an easy step.

But if you have to factor in hardware costs self-hosting doesn't seem attractive. All the models I can self-host I can browse on openrouter and instantly get a provider who can get great prices. With most of the cost being in the GPUs themselves it just makes more sense to have others do it with better batching and GPU utilization

zozbot234 · 2026-02-11T15:22:37 1770823357

If you can get near 100% utilization for your own GPUs (i.e. you're letting requests run overnight and not insisting on any kind of realtime response) it starts to make sense. OpenRouter doesn't have any kind of batched requests API that would let you leverage that possibility.

spmurrayzzz · 2026-02-11T15:54:18 1770825258

For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).

I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.

zozbot234 · 2026-02-11T16:01:39 1770825699

Isn't this just saying that your GPU use is bottlenecked by things such as VRAM bandwidth and RAM-VRAM transfers? That's normal and expected.

spmurrayzzz · 2026-02-11T19:52:18 1770839538

No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)

djsjajah · 2026-02-12T01:01:26 1770858086

That’s kind of a moot point. Even if none of those overheads existed you would still be getting a a fractions of the mfu. Models are fundamental limited by memory bandwidth even with best case scenarios of sft or prefill.

And what are you doing that I/O is a bottleneck?

spmurrayzzz · 2026-02-12T14:20:41 1770906041

> That’s kind of a moot point.

I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.

> And what are you doing that I/O is a bottleneck?

We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.

zozbot234 · 2026-02-12T14:45:42 1770907542

> That's almost entirely waiting for servers/envs to do things

I'm not sure why, sandboxes/envs should be small and easy to scale horizontally to the point where your throughput is no longer limited by them, and the maximum latency involved should also be quite tiny (if adequately optimized). What am I missing?

spmurrayzzz · 2026-02-12T17:01:42 1770915702

First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.

But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.

The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.

You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.

sowbug · 2026-02-11T15:51:08 1770825068

In Silicon Valley we pay PG&E close to 50 cents per kWh. An RTX 6000 PC uses about 1 kW at full load, and renting such a machine from vast.ai costs 60 cents/hour as of this morning. It's very hard for heavy-load local AI to make sense here.

btbuildem · 2026-02-11T16:04:15 1770825855

Yikes.. I pay ~7¢ per kWh in Quebec. In the winter the inference rig doubles as a space heater for the office, I don't feel bad about running local energy-wise.

sheepscreek · 2026-02-12T03:52:40 1770868360

God bless Canada. I love our cheap hydro power. <3

Imustaskforhelp · 2026-02-11T15:58:31 1770825511

And you are forgetting the fact that things like vast.ai subscriptions would STILL be more expensive than Openrouter's api pricing and even more so in the case of AI subscriptions which actively LOSE money for the company.

So I would still point out the GP (Original comment) where yes, it might not make financial sense to run these AI Models [They make sense when you want privacy etc, which are all fair concerns but just not financial sense]

But the fact that these models are open source still means that they can be run when maybe in future the dynamics might shift and it might make sense running such large models locally. Even just giving this possibility and also the fact that multiple providers could now compete in say openrouter etc. as well. All facts included, definitely makes me appreciate GLM & Kimi compared to proprietory counterparts.

Edit: I highly recommend this video a lot https://www.youtube.com/watch?v=SmYNK0kqaDI [AI subscription vs H100]

This video is honestly one of the best in my opinion about this topic that I watched.

HumanOstrich · 2026-02-11T16:19:02 1770826742

Why did you quote yourself at the end of this comment?

Imustaskforhelp · 2026-02-11T17:04:58 1770829498

Oops sorry. Fixed it now but I am trying a HN progressive extension and what it does is if I have any text selected it can actually quote it and I think this is what might've happened or such a bug I am not sure.

It's fixed now :)

int_19h · 2026-02-12T03:34:23 1770867263

Anthropic has very tight limits, so you're basically using the worst (pricing-wise) SOTA cloud model as your baseline. I have $200 subs for both Claude and OpenAI, and I also bump into limits with Claude all the time, whether coding or research. With Codex, I ran into the limit once so far, and that's in a month of very heavy (sometimes literally 24 hours around the clock, leaving long-running tasks overnight) use.

sheepscreek · 2026-02-12T03:51:52 1770868312

I bought the Gemini Ultra to try for a month (at the discounted price). I have been using it non-stop for Opus 4.6 Thinking, which is much better than Gemini 3 Pro (High) and it's been a blast. The most I've managed to consume is 60% of my 5 hourly quota. That was with 2-3 instances in parallel.

I hope too many of us won't be doing this and cause Google to add limits! My hope is Google sees the benefit in this and goes all in - continues to let people decide which Google hosted model to use, including their own.

girvo · 2026-02-12T08:55:34 1770886534

How do you use Opus through Gemini Ultra? I must be missing something

astrod · 2026-02-12T11:12:30 1770894750

It's available in antigravity.

girvo · 2026-02-12T11:26:27 1770895587

Huh, fascinating. I'll check it out

doctoboggan · 2026-02-12T05:37:34 1770874654

Can you use the models you get through Gemini Ultra in Claude Code? If not, what coding tool do you use?

meeq · 2026-02-12T07:03:05 1770879785

Not OP, but I am pretty sure they are using Opencode with a certain antigravity plugin. Not going to link it, since it technically allows breaking TOS. If you‘re not using Opencode yet, I wholeheartedly recommend the switch.

btbuildem · 2026-02-12T16:14:56 1770912896

Getting CC to work with other models is quite straightforward -- setting a few env vars, and a thin proxy that rewrites the requests/responses to be in the expected format.

rnewme · 2026-02-12T07:58:08 1770883088

Claude code router

mythz · 2026-02-11T15:02:58 1770822178

Did the napkin math on M3 Ultra ROI when DeepSeek V3 launched: at $0.70/2M tokens and 30 tps, a $10K M3 Ultra would take ~30 years of non-stop inference to break even - without even factoring in electricity. Clearly people aren't self-hosting to save money.

I've got a lite GLM sub $72/yr which would require 138 years to burn through the $10K M3 Ultra sticker price. Even GLM's highest cost Max tier (20x lite) at $720/yr would buy you ~14 years.

ljosifov · 2026-02-11T15:54:01 1770825241

Everyone should do the calculation for themselves. I too pay for couple of subs. But I'm noticing having an agent work for me 24/7 changes the calculation somewhat. Often not taken into account: the price of input tokens. To produce 1K of code for me, the agent may need to churn through 1M of tokens of codebase. IDK if that will be cached by the API provider or not, but that makes x5-7 times price difference. OK discussion today about that and more https://x.com/alexocheema/status/2020626466522685499

wongarsu · 2026-02-11T15:31:23 1770823883

And it's worth noting that you can get DeepSeek at those prices from DeepSeek (Chinese), DeepInfra (US with Bulgarian founder), NovitaAI (US), AtlasCloud (US with Chinese founder), ParaSail (US), etc. There is no shortage of companies offering inference, with varying levels of trustworthiness, certificates and promises around (lack of) data retention. You just have to pick one you trust

DeathArrow · 2026-02-11T16:30:03 1770827403

I don't think an Apple PC can run full Deepseek or GLM models.

Even if you quantize the hell out of the models to fit in the memory, they will be very slow.

oceanplexian · 2026-02-11T15:53:53 1770825233

Doing inference with a Mac Mini to save money is more or less holding it wrong. Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Buy a couple real GPUs and do tensor parallelism and concurrent batch requests with vllm and it becomes extremely cost competitive to run your own hardware.

mythz · 2026-02-11T16:02:39 1770825759

> Doing inference with a Mac Mini to save money is more or less holding it wrong.

No one's running these large models on a Mac Mini.

> Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Great, where can I find cheaper hardware that can run GLM 5's 745B or Kimi K2.5 1T models? Currently it requires 2x M3 Ultras (1TB VRAM) to run Kimi K2.5 at 24 tok/s [1] What are the better value alternatives?

[1] https://x.com/alexocheema/status/2016404573917683754

Gracana · 2026-02-11T20:35:43 1770842143

Six months ago I'd have said EPYC Turin. You could do a heck of a build with 12Ch DDR5-6400 and a GPU or two for the dense model parts. 20k would have been a huge budget for a homelab CPU/GPU inference rig at the time. Now 20k won't buy you the memory.

mythz · 2026-02-12T01:03:53 1770858233

Not VRAM? What performance are people getting running GLM or Kimi on DDR5?

Gracana · 2026-02-12T18:06:49 1770919609

It's important to have enough VRAM to get the kv cache and shared trunk of the model on GPU, but beyond that it's really hard to make a dent in the pool of 100s of gigabytes of experts.

I wish I had better numbers to compare with the 2x M3 Ultra setup. My system is a few RTX A4000s on a Xeon with 190GB/s actual read bandwidth, and I get ~8 tok/s with experts quantized to INT4 (for large models with around 30B active parameters like Kimi K2.) Moving to 1x RTX Pro 6000 Blackwell and tripling my read bandwidth with EPYC Turin might make it competitive with the the macs, but I dunno!

There's also some interesting tech with ktransformers + sglang where the most frequently-used experts are loaded on GPU. Pretty neat stuff and it's all moving fast.

Gracana · 2026-02-14T19:41:09 1771098069

There's a reddit comment here https://www.reddit.com/r/LocalLLaMA/comments/1r4m4it/comment... that says:

my system is running GLM-5 MXFP4 at about 17 tok/s. That’s with a single RTX Pro 6000 on an EPYC 9455P with 12 channels of DDR5-6400. Only 16k context though, since it’s too slow to use for programming anyway and that’s the only application where I need big context.

Aurornis · 2026-02-11T16:06:17 1770825977

> I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

When talking about fallback from Claude plans, The correct financial comparison would be the same model hosted on OpenRouter.

You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

bigyabai · 2026-02-11T17:50:45 1770832245

> You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

That's a subjective opinion, to which the answer is "no you can't" for many people.

visarga · 2026-02-11T15:45:41 1770824741

Your $5,000 PC with 2 GPUs could have bought you 2 years of Claude Max, a model much more powerful and with longer context. In 2 years you could make that investment back in pay raise.

tw1984 · 2026-02-11T17:03:42 1770829422

> In 2 years you could make that investment back in pay raise.

you can't be a happy uber driver making more money in the next 24 months by having a fancy car fitted with the best FSD in town when all cars in your town have the same FSD.

visarga · 2026-02-11T17:30:09 1770831009

But they don't have the same human in the loop though.

tw1984 · 2026-02-11T17:43:09 1770831789

that software is called autonomous agents, the term autonomous has nothing to do with human in the loop, it is the complete opposite.

drdaeman · 2026-02-12T09:10:46 1770887446

Nothing changed since ’87. Machines still can’t be accountable and still shouldn’t make managerial decisions. Acceptance control is one of those decisions, and all the technical knowledge still matters to form a well-informed one. It may change, of course, but I have an impression that those who try otherwise seem to not fare well after the initial vibecoding honeymoon period. Of course, it varies from case to case - sometimes machines get things right, but long-term luck seems to eventually run out.

benterix · 2026-02-11T16:50:51 1770828651

> In 2 years you could make that investment back in pay raise.

Could you elaborate? I fail to grasp the implication here.

dymk · 2026-02-11T17:43:20 1770831800

This claim has so many assumptions mixed in it's utterly useless

7thpower · 2026-02-11T15:31:49 1770823909

Unless you already had those cards, it probably still doesn’t make sense from a purely financial perspective unless you have other things you’re discounting for.

Doesn’t mean you shouldn’t do it though.

flaviolivolsi · 2026-02-11T15:02:36 1770822156

How does your quantized Qwen3 compares in code quality to Opus?

Aurornis · 2026-02-11T15:29:52 1770823792

Not the person you’re responding to, but my experience with models up through Qwen3-coder-next is that they’re not even close.

They can do a lot of simple tasks in common frameworks well. Doing anything beyond basic work will just burn tokens for hours while you review and reject code.

btbuildem · 2026-02-11T16:15:26 1770826526

It's just as fast, but not nearly as clever. I can push the context size to 120k locally, but quality of the work it delivers starts to falter above say 40k. Generally you have to feed it more bite-sized pieces, and keep one chat to one topic. It's definitely a step down from SOTA.

fauigerzigerk · 2026-02-11T15:16:46 1770823006

>...free from the whims of proprietary megacorps

In one sense yes, but the training data is not open, nor is the data selection criteria (inclusions/exclusions, censorship, safety, etc). So we are still subject to the whims of someone much more powerful that ourselves.

The good thing is that open weights models can be finetuned to correct any biases that we may find.

muyuu · 2026-02-11T16:52:05 1770828725

you have 128GB strix halo machines for US$ ~3k

these run some pretty decent models locally, currently I'd recommend GPT-OSS 120GB, Qwen Coder Next 80B (either Q8 or Q6 quants, depending on speed/quality trade-offs) and the very best model you can run right now which is Step 3.5 Flash (ubergarm GGUF quant) with 256K context although this does push it to the limit - GLMs and nemotrons also worth trying depending on your priorities

there's clearly a big quantum leap in the SotA models using more than 512GB VRAM, but i expect that in a year or two, the current SotA is achievable with consumer level hardware, if nothing else hardware should catch up with running Kimi 2.5 for cheaper than 2x 512GB mac studio ultras - perhaps medusa halo next year supports 512GB and DDR5 comes down again, and that would put a local whatever the best open model of that size is next year within reach of under-US$5K hardware

the odd thing is that there isn't much in this whole range between 128GB and 512GB VRAM requirement to justify the huge premium you pay for Macs in that range - but this can change at any point as every other day there are announcements

saubeidl · 2026-02-11T16:55:32 1770828932

And you can get Strix Halo in a Laptop that looks and feels like a Macbook Pro that can run Linux if you buy an HP ZBook G1A.

Super happy with that thing, only real downside is battery life.

NiloCK · 2026-02-11T14:28:01 1770820081

> Didn't expect to go back to macOS but their basically the only feasible consumer option for running large models locally.

I presume here you are referring to running on the device in your lap.

How about a headless linux inference box in the closet / basement?

Return of the home network!

Aurornis · 2026-02-11T14:32:48 1770820368

Apple devices have high memory bandwidth necessary to run LLMs at reasonable rates.

It’s possible to build a Linux box that does the same but you’ll be spending a lot more to get there. With Apple, a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

cmrdporcupine · 2026-02-11T14:36:35 1770820595

But a $500 Mac Mini has nowhere near the memory capacity to run such a model. You'd need at least 2 512GB machines chained together to run this model. Maybe 1 if you quantized the crap out of it.

And Apple completely overcharges for memory, so.

This is a model you use via a cheap API provider like DeepInfra, or get on their coding plan. It's nice that it will be available as open weights, but not practical for mere mortals to run.

But I can see a large corporation that wants to avoid sending code offsite setting up their own private infra to host it.

zozbot234 · 2026-02-11T14:41:33 1770820893

The needed memory capacity depends on active parameters (not the same as total with a MoE model) and context length for the purpose of KV caching. Even then the KV cache can be pushed to system RAM and even farther out to swap, since writes to it are small (just one KV vector per token).

ingenieroariel · 2026-02-11T14:57:23 1770821843

With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going.

For our code assistant use cases the local inference on Macs will tend to favor workflows where there is a lot of generation and little reading and this is the opposite of how many of use use Claude Code.

Source: I started getting Mac Studios with max ram as soon as the first llama model was released.

Aurornis · 2026-02-11T15:09:25 1770822565

> With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going

I have a Mac and an nVidia build and I’m not disagreeing

But nobody is building a useful nVidia LLM box for the price of a $500 Mac Mini

You’re also not getting as much RAM as a Mac Studio unless you’re stacking multiple $8,000 nVidia RTX 6000s.

There is always something faster in LLM hardware. Apple is popular for the price points of average consumers.

kristianp · 2026-02-12T02:42:13 1770864133

Not many are getting useful inference out of a $500 mac mini, due to only having 16GB of RAM.

zozbot234 · 2026-02-12T03:11:59 1770865919

It depends. This particular model has larger experts with more active parameters so 16GB is likely not enough (at least not without further tricks) but there are much sparser models where an active expert can be in RAM while the weights for all other experts stay on disk. This becomes more and more of a necessity as models get sparser and RAM itself gets tighter. It lowers performance but the end result can still be "useful".

storus · 2026-02-11T15:03:50 1770822230

This. It's awful to wait 15 minutes for M3 Ultra to start generating tokens when your coding agent has 100k+ tokens in its context. This can be partially offset by adding DGX Spark to accelerate this phase. M5 Ultra should be like DGX Spark for prefill and M3 Ultra for token generation but who know when it will pop up and for how much? And it still will be at around 3080 GPU levels just with 512GB RAM.

zozbot234 · 2026-02-11T15:05:40 1770822340

All Apple devices have a NPU which is potentially able to save power for compute bound operations like prefill (at least if you're ok with FP16 FMA/INT8 MADD arithmetic). It's just a matter of hooking up support to the main local AI frameworks. This is not a speedup per se but gives you more headroom wrt. power and thermals for everything else, so should yield higher performance overall.

d3k · 2026-02-11T16:24:19 1770827059

AFAIK, only CoreML can use Apple's NPU (ANE). Pytorch, MLX and the other kids on the block use MPS (the GPU). I think the limitations you mentioned relate to that (but I might be missing something)

FuckButtons · 2026-02-11T15:56:06 1770825366

Vllm-mlx with prefix caching helps with this.

ac29 · 2026-02-11T16:18:51 1770826731

> a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

The cheapest new mac mini is $600 on Apple's US store.

And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic. The laptop I bought last year for <$500 has roughly the same memory speed and new machines are even faster.

jsheard · 2026-02-11T16:26:41 1770827201

> The cheapest new mac mini is $600 on Apple's US store.

And you're only getting 16GB at that base spec. It's $1000 for 32GB, or $2000 for 64GB plus the requisite SOC upgrade.

> And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic.

Yeah, 128-bit is table stakes and AMD is making 256-bit SOCs as well now. Apple's higher end Max/Ultra chips are the ones which stand out with their 512 and 1024-bit interfaces. Those have no direct competition.

zozbot234 · 2026-02-11T14:39:12 1770820752

And then only Apple devices have 512GB of unified memory, which matters when you have to combine larger models (even MoE) with the bigger context/KV caching you need for agentic workflows. You can make do with less, but only by slowing things down a whole lot.

pja · 2026-02-11T17:09:40 1770829780

Only the M4 Pro Mac Minis have faster RAM than you’ll get in an off-the-shelf Intel/AMD laptop. The M4 Pros start at $1399.

You want the M4 Max (or Ultra) in the Mac Studios to get the real stuff.

jannniii · 2026-02-11T14:42:55 1770820975

Indeed and I got two words for you:

Strix Halo

SillyUsername · 2026-02-11T16:00:45 1770825645

Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM. Running TDP about 200W non peak 550W peak (everything slammed, but I've never seen it and I've an AC monitor on the socket). GLM 4.5 Air (60GB Q3-XL) when properly tuned runs at 8.5 to 10 tokens / second, with context size of 8K. Throw in a P100 too and you'll see 11-12.5 t/s (still tuning this one). Performance doesn't drop as much for larger model sizes as the internode communication and DDR4 2400 is the limiter, not the GPUs. I've been using this with 4 channel 96GB ram, recently updated to 128GB.

Aurornis · 2026-02-11T16:08:47 1770826127

> Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM.

Excluding RAM in your pricing is misleading right now.

That’s a lot of work and money just to get 10 tokens/sec

esafak · 2026-02-11T15:35:34 1770824134

How much memory does yours have, what are you running on it, with what cache size, and how fast?

mythz · 2026-02-11T14:54:54 1770821694

Not feasible for Large models, it takes 2x M3 512GB Ultra's to run the full Kimi K2.5 model at a respectable 24 tok/s. Hopefully the M5 Ultra will can improve on that.

vidarh · 2026-02-11T15:14:09 1770822849

I don't really care about being able to self host these models, but getting to a point where the hosting is commoditised so I know I can switch providers on a whim matters a great deal.

Of course, it's nice if I can run it myself as a last resort too.

RALaBarge · 2026-02-12T01:36:51 1770860211

It is pretty easy to set up Open Router and set up schemes to point at different models, but in the same token, you can point at yours locally unless you wanted a "more powerful" answer

nialv7 · 2026-02-11T14:54:37 1770821677

> Didn't expect to go back to macOS but their basically the only feasible consumer option for running large models locally.

Framework Desktop! Half the memory bandwidth of M4 Max, but much cheaper.

thebruce87m · 2026-02-11T15:38:28 1770824308

Does that equate to half the speed in terms of output? Any recommended benchmarks to look at?

nialv7 · 2026-02-11T15:57:50 1770825470

https://kyuz0.github.io/amd-strix-halo-toolboxes/

mikrl · 2026-02-11T14:21:43 1770819703

>I know it doesn't make financial sense to self-host given how cheap OSS inference APIs are now

You can calculate the exact cost of home inference, given you know your hardware and can measure electrical consumption and compare it to your bill.

I have no idea what cloud inference in aggregate actually costs, whether it’s profitable or a VC infused loss leader that will spike in price later.

That’s why I’m using cloud inference now to build out my local stack.

mythz · 2026-02-11T14:35:45 1770820545

Not concerned with electricity cost - I have solar + battery with excess supply where most goes back to the grid for $0 compensation (AU special).

But I did the napkin math on M3 Ultra ROI when DeepSeek V3 launched: at $0.70/2M tokens and 30 tps, a $10K M3 Ultra would take ~30 years of non-stop inference to break even - without even factoring in electricity. You clearly don't self-host to save money. You do it to own your intelligence, keep your privacy, and not be reliant on a persistent internet connection.

gz5 · 2026-02-11T14:46:44 1770821204

hopefully it will spread - many open options, from many entities, globally.

it is brilliant business strategy from China so i expect it to continue and be copied - good things.

reminds me of Google's investments into K8s.

andersa · 2026-02-11T15:01:52 1770822112

They haven't published the weights yet, don't celebrate too early.

andersa · 2026-02-11T19:50:23 1770839423

Now they have!