> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this in...

diggan · 2025-06-04T11:49:21 1749037761

That has to be cold-start, and next N requests would surely be using the already started thing? It sounds bananas they'd even mention using something like that with 19 seconds latency for all requests in any context.

agcat · 2025-06-04T17:38:41 1749058721

That's true. Traditional single-tier storage can not meet the throughput and latency demand. My cofounder wrote this piece on a three-tiered storage architecture optimized for both performance and cost - https://nilesh-agarwal.com/three-tier-storage-architecture-f...

happyopossum · 2025-06-04T19:43:15 1749066195

Sure, but how often is an enterprise deployed LLM application really cold-starting? While you could run this for one-off and personal use, this is probably more geared towards bursty ‘here’s an agent for my company sales reps’ kind of workloads, so you can have an instance warmed, then autoscale up at 8:03am when everyone gets online (or in the office or whatever).

At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.

wut42 · 2025-06-04T11:56:38 1749038198

Definitely -- and yet it's kinda a feat compared to other solutions: when i tried Runpod Serverless i could wait up to five minutes for a cold start to a even more smaller model than a 4B.

infecto · 2025-06-04T12:18:21 1749039501

If you were running a real business with these would the aim not be to overprovision and to setup auto scaling in such a way that you always have excess capacity?

omneity · 2025-06-04T14:15:33 1749046533

That seems to be the gist of it. You cannot rely on serverless alone and you need one or many pre-warmed instances at all times. This distinction is rarely mentioned in serverless GPU spaces yet has been my experience in general.

nullpointerexp · 2025-06-04T15:42:39 1749051759

When scaling from 0 to 1 instances, yes, you have to wait 19 seconds.

For scaling N --> N+1 - If you configure the correct concurrency value (the number of parallel requests one instance can handle), Cloud Run will scale up to additional instances when getting to X% (I think it's 70%). That will be before the instance is fully exhausted. So your users should not experience the 19 seconds cold start.

bravesoul2 · 2025-06-04T11:36:11 1749036971

Looks like GPU instances not "lambda", so presumable you would over-provision to compensate.