What cloud is even remotely worth it over buying 20x rtx 3090 or even some quadr...

scosman · on Aug 29, 2022

As mentioned in the comment, ML training workloads tend to be super chunky (at least in my experience). Some days we want to train 50 models, some weeks we are evaluating and don’t need any compute.

I’d rather be able to spin up 200 gpus in parallel when needed (yes, at a premium), but ramp to 0 when not. Data scientists waiting around are more expensive than GPUs. Replacing/maintaining servers is more work/money than you expect. And for us the training data was cloud native, so transfer/privacy/security is easier; nothing on prem, data scientists can design models without having access to raw data, etc.

machinekob · on Aug 29, 2022

If you are cloud only company then for sure it is just easier but still it wont be cheaper just more convenient to use. If data science team is very big probably "best" solution without unlimited money is just to run local and go cloud [premium] if you dont have free resources for your teams (It was the case when i was working in pretty big EU bank but it wasn't "true" Deep learning yet [about 4-5 years ago]).

varunkmohan · on Aug 29, 2022

You have a good point. I think for small enough workloads self managing instances on-prem is more cost-effective. There is a simplicity gain in being able to scale up and scale down instances in the cloud but may not make sense if you can self-manage without too much work.

beecafe · on Aug 29, 2022

You are years behind if you think you're training a model worth anything on consumer grade GPUs. Table stakes these days is 8x A100 pods, and lots of them. Luckily you can just get DGX pods so you don't have to build racks but for many orgs just renting the pods is much cheaper.

machinekob · on Aug 29, 2022

Ahh yes cause there is only one way to do Deep Learning and it is ofc stacking models large enough to not be useful outside pods of GPUs and this is for sure way to go if you want to make money (from VC ofc cause you wont have much users that are ever willing to pay so much that you'll ever make even, as was OpenAI and other big model providers, maybe you can get some money/sponsoring from state or uni).

Market for local small and efficient models running on device is pretty big maybe even biggest that exist right now [ios, android and macos are pretty easy to monetize with low cost models that are useful]. I can assure you of that and you can do it on even 4x RTX 3090 [ it wont be fast but you'll get there :) ]

cjbgkagh · on Aug 29, 2022

Years behind what? Table stakes for what? There is much more to ML than the latest transformer and diffusion models. While those get the attention the amount of research not in that space dominates.

mushufasa · on Aug 29, 2022

> You are years behind if you think you're training a model worth anything on consumer grade GPUs

Ah yes, my code can't be useful to people unless it takes a long time to compile...

skimo8 · on Aug 29, 2022

To be fair, I think ML workloads are quite a bit different than the days of compiling over lunch breaks.

What the above post was probably trying to get at is that the ML specific hardware is far more efficient these days than consumer GPUs.

rockemsockem · on Aug 29, 2022

300 billion parameters or GTFO, eh?

There is tons of value to be had from smaller models. Even some state of the art results can be obtained on a relatively small set of commodity GPUs. Not everything is GPT-scale.

macksd · on Aug 30, 2022

Isn't a key selling point of the latest, hottest model that's on the front page of Hacker News multiple times right now, the fact that it fits on consumer-grade GPUs? Surely some of the interesting ideas it's spawning right now are people doing transfer learning on GPUs that don't end in "100", don't you think?

fragmede · on Aug 30, 2022

for what it's worth, stable diffusion was trained on 32 x 8 x A100 GPUs

macksd · on Aug 30, 2022

You know there's a huge difference between training the original model and transfer learning to apply it to a new use case, right? Saying people are years behind if they think there work is only worth something with 8 A100 pods is pretty ignorant of how most applications get built. Not everyone's trying to design novel model architectures, nor should they.

potatoman22 · on Aug 30, 2022

Most models actually being used are linear regressions and decision trees

thebruce87m · on Aug 30, 2022

My model takes 6 hours to train on a 3090. People have different use cases.