> But inference latency just doesn’t seem to matter yet, so the market doesn’t care.
This is a very strange statement to make. They are acting like inference today happens with freshly spun up VMs and model access over remote networks (and their local switching could save the day). It’s actually hitting clusters of hot machines with the model of choice already loaded into VRAM.
In real deployments, latency can be small (if implemented well), and speed is comes down to the right GPU config for the model (why fly doesn’t offer).
People have built better shared resource usage inference systems for Loras (openAI, Fireworks, Lorax) - but it’s not VMs. It’s model aware, the right hardware for the base model, and optimizing caching/swapping the Loras.
I’m not sure the Fly/VM way will ever be the path for ML. Their VM cold start time doesn’t matter if the app startup requires loading 20GB+ of weights.
Companies like Fireworks are working fast Lora inference cold starts. Companies like Modal are working on fast serverless VM cold starts with a range of GPU configs (2xH100, A100, etc). These seem more like the two cloud primitives for AI.
I think what they mean about latency not mattering is that latency to the LLM provider doesn’t matter. So why run it yourself when there are API’s you can hit that provide a better overall experience (and seems to be dropping in cost 90% year over year).
This is a very strange statement to make. They are acting like inference today happens with freshly spun up VMs and model access over remote networks (and their local switching could save the day). It’s actually hitting clusters of hot machines with the model of choice already loaded into VRAM.
In real deployments, latency can be small (if implemented well), and speed is comes down to the right GPU config for the model (why fly doesn’t offer).
People have built better shared resource usage inference systems for Loras (openAI, Fireworks, Lorax) - but it’s not VMs. It’s model aware, the right hardware for the base model, and optimizing caching/swapping the Loras.
I’m not sure the Fly/VM way will ever be the path for ML. Their VM cold start time doesn’t matter if the app startup requires loading 20GB+ of weights.
Companies like Fireworks are working fast Lora inference cold starts. Companies like Modal are working on fast serverless VM cold starts with a range of GPU configs (2xH100, A100, etc). These seem more like the two cloud primitives for AI.