More

beernet · 2026-05-22T15:59:29 1779465569

> The reporting doesn't mention it, because it doesn't fit the narrative, but does anyone want to guess how many human drivers got suddenly stuck in the flood?

This is the important point here. Human beings are highly apologetic towards other human beings, but not so much towards machines. At the same time, the expectation towards machines is much higher.

Tells you more about humans than machines.

flextheruler · 2026-05-23T14:32:29 1779546749

Comparing individual drivers to a taxi company is disingenuous and silly.

"How many taxi drivers got stuck?" is the actual comparison.

This is a paid taxi service. If I got an uncooked meal from a restaurant and their defense was that more people accidentally don't cook their food long enough than we do I don't think I'd be accommodating.

DannyBee · 2026-05-24T04:07:43 1779595663

"How many taxi drivers got stuck?" is the actual comparison."

I would bet a lot of money the number here is also >1.

beernet · 2026-05-22T15:57:01 1779465421

Yep, the great theoretical promise of local models remains theoretical, no matter how much die hard-engineers want to push it...Who would have thought, right?

beernet · 2026-05-22T15:44:14 1779464654

Only on HN will people doubt the moat of a company with >5T market cap at an annualized revenue of 400B with YoY growth close to 100%. Yes, bubbles gonna bubble, but what?

0xDEAFBEAD · 2026-05-23T03:55:45 1779508545

Annualized revenue and YoY growth have little to do with long-term moat width

beernet · 2026-05-02T12:43:59 1777725839

Meh. The Granite models have always been at least one year behind in terms of capabilities. Feels like they are forced just to satisfy shareholders, on paper at least.

beernet · 2026-04-28T18:24:49 1777400689

More than by the downtime I am much more surprised by the actual uptime. Hard to imagine how difficult this must be, given the speed of growth.

nippoo · 2026-04-28T18:34:13 1777401253

Truly! As someone who's worked with HPC and GPUs in a scientific research context, trying to get a service like this to work reliably is a different ballgame to your usual webapp stack...

lostlogin · 2026-04-28T18:42:26 1777401746

But… imagine that same scientific research but you have an unlimited budget. I’d imagine that helps.

Some of the comments here mention their monthly spend, and it’s eye watering.

handoflixue · 2026-04-28T21:44:35 1777412675

It would be "unlimited budget" if they were a monopoly, but they're in a bidding war with three other "unlimited" budget AI companies, over a resource no one expected to be scarce. There's simply not enough supply to meet demand, no matter how much money you have

rvnx · 2026-04-28T19:45:10 1777405510

I think you have to see this as a bunch of stateless requests, and this makes the problem way easier.

  LLM requests that do not call tools do not need anything external by definition.
  No central server, nothing, they can even survive without the context cache.
  All you need is to load (and only once!) the read-only immutable model weights from a S3-like source on startup.

  If it takes 4 servers to process a request, then you can group them 4 by 4, and then send a request to each group (sharding).

  Copy-paste the exact same-setup XXX times and there you have your highly-parallelizable service (until you run out of money).

It's very doable, any serious SRE can find a way setup "larger than one card" models like Kimi or DeepSeek (unquantized) if they have a tightly-coupled HPC (or a pair of very very beefy servers).

If you run out of servers, then again a money problem, but not an architectural problem (and modern datacenters are already scalable).

Take the best SRE, but no budget, and there is no solution.

So inference is the easy part.

Codex or Claude Code if it takes lot of time or have slow cold latency, it's considered very acceptable.

Some users would probably not even see the difference if a request takes 2 minutes versus 3 minutes.

The real difficult part is to have context caching and external tools, because now you are depending on services that might be lagging.

  Executing code, browsing the web, all of that is tricky to scale because they are very unreliable (tends to timeout, requires large cache of web pages, circumventing captchas, etc).

These are traditional scaling problems, but they are more difficult because all these pieces are fragile and queues can snowball easily.

BoneShard · 2026-04-29T04:02:31 1777435351

Yeah, and totally missed RAI part, billing, model deployment, security patches, rate-limiting, caching, dead GPUs, metrics, multiple regions, gov clouds, gdpr(or data locality issues), monitoring, alerting and god knows what else while at extreme loads.

rvnx · 2026-04-29T04:44:35 1777437875

GDPR doesn’t affect load, dead GPUs are no different than any software freeze, model is a file update, metrics are already scaling very well and even way way way bigger and they are very linear, security updates are hedged with gradual rollouts, canary, feature flags, etc.

From an ops perspective all of these things are already really well solved issues in a very scalable manner, because plenty of companies had to solve these issues before.

It’s even better here because you can throw millions in salaries to “steal” the insider info on how their production actually.

No doubt it is fast-paced but the complexity to go from 100k GPUs to 1M is much lower than from going from 1k to 10k GPUs.

All 3 big AI companies had the luxury that during the scaling phase they could do everything directly on production servers.

This is because customers were very very tolerant, and are still quite tolerant.

You can even set limits of requests to large users and shape the traffic.

Cloudflare in comparison, high-scale, low-latency, end users not tolerant at all to downtime, customers even less tolerant, clearly hostile actors that actively try to make your systems down, limited budget, a lot of different workloads, etc.

So, for LLM companies where you have to scale a single workload, largely from mostly free users, and where most paid customers can be throttled and nobody is going to complain because nobody knows what are the limits + a lot of tolerance to high-latency and even downtimes then you are very lucky.

CSSer · 2026-04-28T18:40:22 1777401622

Can you speak a little more to this? I'm curious what kind of parameters one must consider/monitor and what kind of novel things could go wrong.

aleksiy123 · 2026-04-28T19:12:07 1777403527

My guesses are:

hardware capacity constraints is going to be the big one

Effective caching is another, I bet if you start hitting cold caches the whole things going to degrade rapidly.

The ground is probably shifting pretty rapidly.

Power users are trying to get the most out of their subscriptions and so are hammering you as fast as they possibly can. See Ralph loops.

Harnesses are evolving pretty rapidly, as well as new alternatives harnesses. Makes the load patterns less predictable, harder to cache.

The demand is increasing both from more customers, but also from each user as they figure out more effective workflows.

Users are pretty sensitive to model quality changes. You probably want smart routing, but users want the best model all the time.

Models keep getting bigger and bigger.

On top of that they are probably hiring more onboarding more, system complexity and codebase complexity is growing.

Yhippa · 2026-04-28T21:10:58 1777410658

Just ask Claude and some agents to fix it...

wrs · 2026-04-28T18:44:46 1777401886

On the other hand, the status page is blaming the authentication system, which one would think is not a frontier-class problem.

Havoc · 2026-04-28T23:48:26 1777420106

Would have thought that compared to training the serving part is pretty easy. Less of a “everything needs to come together at once” and more just move demand to a working cluster if one bombs & have some spare capacity

hulitu · 2026-05-05T12:45:38 1777985138

You mean, the AI sleeps ?

beernet · 2026-04-28T16:49:04 1777394944

Agreed. Just shows that big money doesn't dilude small character.

beernet · 2026-04-28T14:47:39 1777387659

Obviously they are. And it is solely due to Sam and his unstoppable desire for influence, which is pathetic. He really fumbled the top position in arguably the most important race ever. Pretty incredible.

kybb4 · 2026-04-28T15:38:28 1777390708

Its how the world works. See Theory of the Lesiure Class. Or how Microsft survived Bill Gates.

beernet · 2026-04-28T11:07:25 1777374445

Link is behind a paywall. In any case, I do not think you can evaluate any company for "AI agent readiness" (what even is that?) without having detailed insights into the internal systems and processes of the company.

beernet · 2026-04-28T11:05:10 1777374310

What is "sovereign infra" exactly?

mathgeek · 2026-04-28T11:22:31 1777375351

I know it's just marketing speak, but the term made me think of the scenes in the Matrix where what's left of humanity (ignoring all the cyclical lore that was added on top of it) has to make sure the machines can't remote in to any of their tech.

tfrancisl · 2026-04-28T11:09:33 1777374573

No less than self hosted, imo. If youre on some cloud it doesnt really matter that you pay them absurd amounts of money, you arent sovereign.

beernet · 2026-04-28T13:27:50 1777382870

So if a company self hosts their physical infrastructure which will burn down once a fire sets in, they are more "sovereign" than a company running on a redundant cloud? I definitely would not want to be "sovereign" then.

Point is: This discussion is much more multi-dimensional than some suggest.

tfrancisl · 2026-04-28T20:02:54 1777406574

A redundant cloud that could be rug pulled from you any day if the platform decides you are in violation of their terms, or if they just dont like your project. Yes, on prem is more sovereign than that. That doesnt mean it doesn't have drawbacks, and no one said it didnt. But if sovereignty is more important than redundancy, then on prem is certainly an option.

embedding-shape · 2026-04-28T11:22:59 1777375379

So literally a computer at home/in the office, as with anything else you don't really "own" the infrastructure? Or is this just about "cloud"?

icy · 2026-04-28T11:37:19 1777376239

Yeah sorry it's marketing BS speak for self-hosted or just infra that you control. It could be a VPS, it could be a Raspberry Pi at home. Your repos live on your servers. (And we support this on Tangled today!)

embedding-shape · 2026-04-28T11:38:39 1777376319

> just infra that you control

But a VPS isn't actually infrastructure you control, you essentially have as much control over it as "cloud", so I don't think that'd be counted as "sovereign", would it?

icy · 2026-04-28T11:46:38 1777376798

Perhaps, but it's still better than nothing!

beernet · 2026-04-16T14:43:44 1776350624

They obviously collaborate with some of the labs prior to the official release date.

sigbottle · 2026-04-16T14:46:35 1776350795

That... is a more plausible explanation I didn't think of.

danielhanchen · 2026-04-16T15:07:24 1776352044

Yes we collab with them!

qskousen · 2026-04-16T18:13:04 1776363184

Sorry this is a bit of a tangent, but I noticed you also released UD quants of ERNIE-Image the same day it released, which as I understand requires generating a bunch of images. I've been working to do something similar with my CLI program ggufy, and was curious of you had any info you could share on the kind of compute you put into that, and if you generate full images or look at latents?

danielhanchen · 2026-04-17T08:34:41 1776414881

Yes we have started doing diffusion GGUFs but it's in it's infancy :) But yes we do generate images to test quants out!