What cloud is even remotely worth it over buying 20x rtx 3090 or even some quadro for training?
Maybe if u have very small team and small problems but if you have CV/Video tasks and team more than 3 maybe even 2 people in house servers are always better choice as you'll get your money back in 2-3 months of training over cloud solution and maybe even more if you wait for rtx 4090.
And if you are solo dev its even easier choice as you can reuse your rig for other stuff when you dont train anything (for example gaming :D).
Only possibility is if you get free 100k from AWS and then 100k from GCP you can live with that for a year or even two if u stack both providers but it is special case and im not sure how easy it is to get 100k right now.
As mentioned in the comment, ML training workloads tend to be super chunky (at least in my experience). Some days we want to train 50 models, some weeks we are evaluating and don’t need any compute.
I’d rather be able to spin up 200 gpus in parallel when needed (yes, at a premium), but ramp to 0 when not. Data scientists waiting around are more expensive than GPUs. Replacing/maintaining servers is more work/money than you expect. And for us the training data was cloud native, so transfer/privacy/security is easier; nothing on prem, data scientists can design models without having access to raw data, etc.
If you are cloud only company then for sure it is just easier but still it wont be cheaper just more convenient to use.
If data science team is very big probably "best" solution without unlimited money is just to run local and go cloud [premium] if you dont have free resources for your teams (It was the case when i was working in pretty big EU bank but it wasn't "true" Deep learning yet [about 4-5 years ago]).
You have a good point. I think for small enough workloads self managing instances on-prem is more cost-effective. There is a simplicity gain in being able to scale up and scale down instances in the cloud but may not make sense if you can self-manage without too much work.
You are years behind if you think you're training a model worth anything on consumer grade GPUs. Table stakes these days is 8x A100 pods, and lots of them. Luckily you can just get DGX pods so you don't have to build racks but for many orgs just renting the pods is much cheaper.
Ahh yes cause there is only one way to do Deep Learning and it is ofc stacking models large enough to not be useful outside pods of GPUs and this is for sure way to go if you want to make money (from VC ofc cause you wont have much users that are ever willing to pay so much that you'll ever make even, as was OpenAI and other big model providers, maybe you can get some money/sponsoring from state or uni).
Market for local small and efficient models running on device is pretty big maybe even biggest that exist right now [ios, android and macos are pretty easy to monetize with low cost models that are useful].
I can assure you of that and you can do it on even 4x RTX 3090 [ it wont be fast but you'll get there :) ]
Years behind what? Table stakes for what? There is much more to ML than the latest transformer and diffusion models. While those get the attention the amount of research not in that space dominates.
There is tons of value to be had from smaller models. Even some state of the art results can be obtained on a relatively small set of commodity GPUs. Not everything is GPT-scale.
Isn't a key selling point of the latest, hottest model that's on the front page of Hacker News multiple times right now, the fact that it fits on consumer-grade GPUs? Surely some of the interesting ideas it's spawning right now are people doing transfer learning on GPUs that don't end in "100", don't you think?
You know there's a huge difference between training the original model and transfer learning to apply it to a new use case, right? Saying people are years behind if they think there work is only worth something with 8 A100 pods is pretty ignorant of how most applications get built. Not everyone's trying to design novel model architectures, nor should they.
And if you are solo dev its even easier choice as you can reuse your rig for other stuff when you dont train anything (for example gaming :D).
Only possibility is if you get free 100k from AWS and then 100k from GCP you can live with that for a year or even two if u stack both providers but it is special case and im not sure how easy it is to get 100k right now.