> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)
This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.
If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.
That has to be cold-start, and next N requests would surely be using the already started thing? It sounds bananas they'd even mention using something like that with 19 seconds latency for all requests in any context.
That's true. Traditional single-tier storage can not meet the throughput and latency demand. My cofounder wrote this piece on a three-tiered storage architecture optimized for both performance and cost - https://nilesh-agarwal.com/three-tier-storage-architecture-f...
Sure, but how often is an enterprise deployed LLM application really cold-starting? While you could run this for one-off and personal use, this is probably more geared towards bursty ‘here’s an agent for my company sales reps’ kind of workloads, so you can have an instance warmed, then autoscale up at 8:03am when everyone gets online (or in the office or whatever).
At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.
Definitely -- and yet it's kinda a feat compared to other solutions: when i tried Runpod Serverless i could wait up to five minutes for a cold start to a even more smaller model than a 4B.
If you were running a real business with these would the aim not be to overprovision and to setup auto scaling in such a way that you always have excess capacity?
That seems to be the gist of it. You cannot rely on serverless alone and you need one or many pre-warmed instances at all times. This distinction is rarely mentioned in serverless GPU spaces yet has been my experience in general.
When scaling from 0 to 1 instances, yes, you have to wait 19 seconds.
For scaling N --> N+1 - If you configure the correct concurrency value (the number of parallel requests one instance can handle), Cloud Run will scale up to additional instances when getting to X% (I think it's 70%). That will be before the instance is fully exhausted.
So your users should not experience the 19 seconds cold start.
This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.
If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.