Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because it’s a waste for anything other than proof of concept/handful of users.

It’s really simple to take some python inference code and wrap FastAPI around it.

However, inference servers exist for a reason. You’ll quickly find that performance, VRAM usage, model management, etc isn’t practical with the FastAPI approach.

Speaking personally inference server implementations like Nvidia Triton bring a model to performance metrics that are absolutely night and day vs the FastAPI approach - in many cases orders of magnitude higher performance in terms of response time and requests per second.



Can you list the concrete problems a FastAPI approach will have, and what tools like Nvidia Triton do differently to get around it? I have no idea about running such models at scale.


Not GP, but what NVidia Triton can do includes

- Dynamic batching while limiting latency to a set threshold

- Running multiple instances of a model, effectively load-balancing inference requests.

- Loading/unloading/running multiple versions of models dynamically, which is useful if you want to update (or roll back) your model while not interfering with existing inference requests.

Its client provides async based inference APIs, so you can easily put a FastAPI-based API server in front and don't necessarily need a queue (like Celery).


Sure!

FastAPI loads a model statically on startup. There are some hacks to reload versions and new models via things with load balancers, etc but they’re just that - hacks. There are also known issues with TensorFlow especially having poor memory management over request count.

FastAPI is great but at the end of the day it’s Python and the performance reflects that (more on this later).

With Nvidia Triton you get:

- Automatic support for various model frameworks/formats: native PyTorch/TensorFlow, ONNX, and more.

- Dynamic batching. You can configure an SLA with max additional latency for response time where Triton will queue requests from multiple clients over a given time period and pass them through though model batched. If you have the VRAM (you should) it’s an instant performance multiplier.

- Even better performance: Triton can do things like automatically compile/convert a model to TensorRT on the runtime hardware. This allows you to deploy models across hardware families with optimized performance while not worrying about the specific compute architecture or dealing with TensorRT itself.

- Optimized and efficient use of multiple GPUs.

- Model version management. Triton has a model management API you can use to upload a new model/version and load it dynamically. It can hot load/reload a model and serve it instantly, with configuration options for always serving the latest model or allowing client to request a specific version.

- Performance metrics. It has built in support for Prometheus.

- Other tools like Model Navigator and Performance Analyzer. You can pass a model to these tools and they will try every possible model format, batch size, etc, etc against an actual Triton server and produce a report and optimized model configuration based on your selected parameters - requests per second, response time, etc. Even memory/compute utilization, power usage, and more.

- Out of the box without any of these tricks Triton is faster, uses less memory, less GPU compute, and less CPU compute. Written in C and optimized by Nvidia.

- It’s a single implementation (often container) that from the get go is smaller, lighter weight, and easier to manage than pip installing a bunch of dependencies and the entire runtime framework itself. It exists solely to serve models and serve them well.

When you add it up (as I mentioned) I’ve personally seen cases where requests per second increase by orders of magnitude with lower response times than a single request against FastAPI (or similar). Plus all of the mlops and metrics features.

Frankly, it’s pretty amazing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: