I feel like these guys are missing a pretty important point in their own analysis. I tried setting up a ollama LLM on a fly.io GPU machine and it was near impossible because of fly.io limitations such as:
1. Their infrastructure doesnt support streaming responses well at all (which is important part of the LLM experience in my view)
2. The LLM itself is massive, and cant be part of the docker image I was building and uploading. Fly doesnt have a nice way around this, so I had to setup a whole heap of code to pull it in on the fly machines first invocation, which doesnt work well if you start to run multiple machines. It was messy and ended up with a long support ticket with them that didnt get it working any better so I gave up.
What kind of issues did you have with streaming? I also set up ollama on fly.io, and had no issues getting streaming to work.
For the LLM itself, I just used a custom startup script that downloaded the model once ollama was up. It's the same thing I'd do on a local cluster though. I'm not sure how fly could make it better unless they offered direct integration with ollama or some other inference server?
I mean, yes? Managing giant model weight files is a big problem with getting people on-demand access to Docker-based micro-VMs. I don't think we missed that point so much as that we acknowledged it, and found some clarity in the idea that we weren't going to break up our existing DX just to fix it. If there were lots and lots and lots of people trying to self-host LLMs running into this problem, it would have been a harder call.