Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I consider HuggingFace more "Open AI" than OpenAI - one of the few quiet heroes (along with Chinese OSS) helping bring on-premise AI to the masses.

I'm old enough to remember when traffic was expensive, so I've no idea how they've managed to offer free hosting for so many models. Hopefully it's backed by a sustainable business model, as the ecosystem would be meaningfully worse without them.

We still need good value hardware to run Kimi/GLM in-house, but at least we've got the weights and distribution sorted.



Can we toss in the work unsloth does too as an unsung hero?

They provide excellent documentation and they’re often very quick to get high quality quants up in major formats. They’re a very trustworthy brand.


Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.


Haha for now our primary goal is to expand the market for local AI and educate people on how to do RL, fine-tuning and running quants :)


Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).

On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)

I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.


Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!


> How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

Very hard. $$$

The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.


Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!


Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?


Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.

Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.

We could run them after a model release which might work as well.

This is also on 1 benchmark.


This would be amazing


Working on it! :)


I hope that is exactly what is happening. It benefits them, and it benefits us.


not that unsung! we've given them our biggest workshop spot every single year we've been able to and will do until they are tired of us https://www.youtube.com/@aiDotEngineer/search?query=unsloth


Appreciate it immensely haha :) Never tired - always excited and pumped for this year!


Oh thank you - appreciate it :)


I'm a big fan of their work as well, good shout.


Thank you!


It's insane how much traffic HF must be pushing out of the door. I routinely download models that are hundreds of gigabytes in size from them. A fantastic service to the sovererign AI community.


My fear is that these large "AI" companies will lobby to have these open source options removed or banned, growing concern. I'm not sure how else to explain how much I enjoy using what HF provides, I religiously browse their site for new and exciting models to try.


ModelScope is the Chinese equivalent of Hugging Face and a good back up. All the open models are Chinese anyways


Not true! Mistral is really really good, but I agree that there isn't a single decent open model from the USA.


Mistral is cool and I wish them success but it consistently ranks extremely low on benchmarks while still being expensive. Chinese models like DeepSeek might rank almost as low as Mistral but they are significantly cheaper. And Kimi is the best of both worlds with incredible benchmark results while still being incredibly cheap

I know things change rapidly so I'm not counting them out quite yet but I don't see them as a serious contender currently


Sure, benchmarks are fake and I use Mistral over equivalently sized models most of the time because it's better in real life. It runs plenty fast for me, I don't pay for inference.


> it consistently ranks extremely low on benchmarks

As general purpose chatbots small Mistral models are better than comparably sized Chiniese models, as they have better SimpleQA scores and general knowledge of Western culture.


It’s really hard to beat qwen coder, especially for role play where the instruction following is really useful. I don’t think their corpus is lacking in western knowledge, although I wonder if Chinese users get even better results from it?


> It’s really hard to beat qwen coder, for role play

I am not sure if you actually tried that. Mistrals are widely asccepted go-to models for roleplay and creative writing. No Qwens are good at prose, except for their latest big Qwen 3.5.

> I don’t think their corpus is lacking in western knowledge,

It absolutely does, especially pop culture knowledge.


Instruct and coder just follow instructions so well though. I guess I’ve just never been able to make mistral work well, I guess.


Qwen3 30B A3B and that big 400+ B Coder were absolutely terrible at editing fiction. I would tell them what to change in the prose and they'd just regurgitate text with no changes.


Did you try asking Gemini what model to use and how to configure/set it up? It has worked wonders for me, ironically (since I’m using a big model to setup smaller local models).


> Did you try asking Gemini what model to use and how to configure/set it up?

That would besuboptimal, as Gemini has too old knowledge cutoff. I am long past the need for such an advice anyway, as I've been using local models since mid 2024.


Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on. Actually, I’m a bit mortified that not everyone knows this. If you ask Gemini (from the search interface) about a current event that happened yesterday, they will use search to pull in context and work with that. Also about model that was released yesterday, it can do that.

It’s only a very low level model access where search isn’t used. Local models also need to be configured to use search, and I haven't had a use case to do that yet.

Gemini seems to call this “grounding with google search”. If you have Gemini installed in your enterprise, it will also search internal data sources for context.


> Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on.

If decides to do so, and even then baked in knowledge would influence the result.

In any case I do not need Gemini or any other LLMs to figure out setting for my llama.cpp, thank you very much.


It has always searched the web for me, and it can give me pretty good guidance about a model released in the last week. All models ATM are trying to reduce dependence on internal knowledge mostly through RAG. Anyways, this part of LLMs has gotten much better in the last 6 months.

If you are able to figure out the right settings for a model Thats was released last week, then great for you! But it sounds like you just don’t trust LLMs to use current knowledge, and have some misconception about how they satisfy recent knowledge requests.


Why are you talking price when we are talking local AI?

That doesn't make any sense to me. Am I missing something?


15 missed calls from your local power company


Your electricity is free?


Apple silicon is crazy efficient as well as being comparable to GPUs in performance for max and ultra chips.


If you have the hardware to run expensive models, is the cost of electricity much of a factor? According to Google, the average price in the Silicon Valley Area is $0.448 per kWh. An RTX 5090 costs about $4,000 and has a peak power consumption of 1000 W. Maxing out that GPU for a whole year would cost $3,925 at that rate. It's not particularly more expensive than that hardware itself.


At that point it'd be cheaper to get an expensive subscription to a cloud platform AI product. I understand the case for local LLMs but it seems silly to worry about pricing for cloud-based offerings but not worry about pricing for locally run models. Especially since running it locally can often be more expensive


for almost the entire year, yes.


Arcee is working on that, see a blog post about their newest in progress model here: https://www.arcee.ai/blog/trinity-large

Its still not fully post trained and its a non-reasoning model, but its worth keeping an eye on if you dont want to use the Chinese models that currently are the best open-weight options.


To be fair there are lots of worse models than OpenAI's GPT-OSS-120b. It's not a standout when positioned next to the latest releases from China, but prior to the current wave it was considered one of the stronger local models you can reasonably run.


They can try. I don't think they'll be able to get the toothpaste back in the tube. The data will just move our of the country.


Many of the models on hugging face are already Chinese. It’s kind of obvious that local AI is going to flourish more in China than the USA due to hardware constraints.


How do you choose which models to try for which workflows? Do you have objective tests that you run, or do you just get a feel for them while using them in your daily workflow?


it’s only a matter of time. we have all seen first hand how … wrong … these companies behave, almost on a regular basis.

there’s a small tinfoil hat part of me that suspects part of their obscene investments and cornering the hardware market is driven by an conscious attempt to stop open source local from taking off. they want it all, the money, the control, and to be the only source of information to us.


Bandwidth is not that expensive. The Big 3 clouds just want to milk customers via egress. Look at Hetzner or CloudFlare R2 if you want to get get an idea of commodity bandwidth costs.


Yup, I have downloaded probably a terabyte in the last week, especially with the Step 3.5 model being released and Minimax quants. I wonder what my ISP thinks. I hope they don't cut me off. They gave me a fast lane, they better let me use it, lol


Even fairly restrictive data caps are in the range of 6 Tb per month. P2P at a mere 100 Mb works out to 1 TiB per 24 hours.

Hypothetically my ISP will sell me unmetered 10 Gb service but I wonder if they would actually make good on their word ...


I have a 1.2TB cap before you start getting charged extra, so you might need to recalibrate your restrictive level.


Is that with a WISP by chance? Or in a developing country? Or are there really wired providers with such low caps in the western world in this day and age?


ATT once told me if I don't pay for their TV service then my home gigabit fiber would have a 1TB cap. They had an agreement with the apartment building so I had no other choice of provider.


Buy our off brand netflix or else we'll make it so you can't watch netflix. How is that legal?


The law is written by the highest bidder, and the telecom lobbyists are very generous


well it's my wired cap a stone's throw from buildings with google cloud logos on the side in a major us city, so...


Comcast.


> We still need good value hardware to run Kimi/GLM in-house

If you stream weights in from SSD storage and freely use swap to extend your KV cache it will be really slow (multiple seconds per token!) but run on basically anything. And that's still really good for stuff that can be computed overnight, perhaps even by batching many requests simultaneously. It gets progressively better as you add more compute, of course.


> it will be really slow (multiple seconds per token!)

This is fun for proving that it can be done, but that's 100X slower than hosted models and 1000X slower than GPT-Codex-Spark.

That's like going from real time conversation to e-mailing someone who only checks their inbox twice a day if you're lucky.


You'd need real rack-scale/datacenter infrastructure to properly match the hosted models that are keeping everything in fast VRAM at all times, and then you only get reasonable utilization on that by serving requests from many users. The ~100X slower tier is totally okay for experimentation and non-conversational use cases (including some that are more agentic-like!), and you'd reach ~10X (quite usable for conversation) by running something like a good homelab.


At a certain point the energy starts to cost more than renting some GPUs.


Yeah, that is hard to argue with because I just go to OpenRouter and play around with a lot of models before I decide which ones I like. But there's something special about running it locally in your basement


I'd love to hear more about this. How do you decide that you like a model? For which use cases?


Aren't decent GPU boxes in excess of $5 per hour? At $0.20 per kWhr (which is on the high side in the US) running a 1 kW workstation 24/7 would work out to the same price as 1 hour of GPU time.

The issue you'll actually run into is that most residential housing isn't wired for more than ~2kW per room.


Why doesn't HF support BitTorrent? I know about hf-torrent and hf_transfer, but those aren't nearly as accessible as a link in the web UI.


> Why doesn't HF support BitTorrent?

Harder to track downloads then. Only when clients hit the tracker would they be able to get download states, and forget about private repositories or the "gated" ones that Meta/Facebook does for their "open" models.

Still, if vanity metrics wasn't so important, it'd be a great option. I've even thought of creating my own torrent mirror of HF to provide as a public service, as eventually access to models will be restricted, and it would be nice to be prepared for that moment a bit better.


I thought of the tracking and gate questions, too, when I vibed up an HF torrent service a few nights ago. (Super annoying BTW to have to download the files just to hash the parts, especially when webseeds exist.) Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

It's a bit like any legalization question -- the black market exists anyway, so a regulatory framework could bring at least some of it into the sunlight.


> Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.

But that'll only stop a small part, anyone could share the infohash and if you're using the dht/magnet without .torrent files or clicks on a website, no one can count those downloads unless they too scrape the dht for peers who are reporting they've completed the download.


> unless they too scrape the dht for peers who are reporting they've completed the download.

Which can be falsified. Head over to your favorite tracker and sort by completed downloads to see what I mean.


Right, but that's already happening today. That's the black-market point.


That would be a very nice service. I think folks might rely on it for a number of reasons, including that we'll want to see how biases changed over time. What got sloppier, shillier...


Wouldn’t it still provide massive benefits if they could convince/coerce their most popular downloaded models to move to torrenting?


Benefit to you, but great downside to the three letter agencies that inject their goods into these models.


how are all the private trackers tracking ratios?


most of the traffic is probably from open weights, just seed those, host private ones as is


I still don't know why they are not running on torrent. Its the perfect use case.


How can you be the man in the middle in a truly P2P environment?


That would shut out most people working for big corp, which is probably a huge percentage of the user base. It's dumb, but that's just the way corp IT is (no torrenting allowed).


It's a sensible option, even when not everyone can really use it. Linux distros are routinely transfered via torrent, so why not other massive, open-licensed data?


Oh as an option, yeah I agree it makes a ton of sense. I just would expect a very, very small percentage of people to use the torrent over the direct download. With Linux distros, the vast majority of downloads still come from standard web servers. When I download distro images I opt for torrents, but very few people do the same


> very small percentage of people to use the torrent over the direct download

BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.

On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.

I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.


With Linux distros they typically put the web link right on the main page and have a torrent available if you go look for it, because they want you to try their distro more than they want to save some bandwidth.

Suppose HF did the opposite because the bandwidth saved is more and they're not as concerned you might download a different model from someone else.


I have terabytes of linux isos I got via torrents, many such cases!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: