Are GPUs Worth It for ML?

PeterisP · on Aug 29, 2022

For some reason they focus on the inference, which is the computationally cheap part. If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.

varunkmohan · on Aug 29, 2022

Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.

PeterisP · on Aug 29, 2022

The way I see it, generally every new data point (on which the production model inference gets run once) becomes part of the data set which then gets used in training every next model, processing the same data point many more times in training, thus training unavoidably taking more effort than inference.

Perhaps I'm a bit biased towards all kinds of self-supervised or human-in-the-loop or semi-supervised models, but the notion of discarding large amounts of good domain-specific data that get processed only for inference and not used for training afterward feels a bit foreign to me, because you usually can extract an advantage from it. But perhaps that's the difference between data-starved domains and overwhelming-data domains?

version_five · on Aug 29, 2022

What you say re saving all data is the ideal. I'd add a couple caveats, one is that in many fields you often get lots of redundant data that adds nothing to training (for example if an image classifier looking for some rare class you can be drowning in images of the majority class). Or you can just have lots of data that is unambiguously and correctly classified- some kind of active learning can tell you what is worth keeping.

The other is that for various reasons the customer doesnt want to share their data (or at least have sharing built into the inference system) so even if you'd like to have everything they record, it's just not available. Obviously something to discourage but it seems common

pdpi · on Aug 29, 2022

There's one piece of the puzzle you're missing: field-deployed devices.

If I play chess on my computer, the games I play locally won't hit the Stockfish models. When I use the feature on my phone that allows me to copy text from a picture, it won't phone home with all the frames.

varunkmohan · on Aug 29, 2022

Yup, exactly. It's a good point that for self-supervised workloads, the training set can become arbitrarily large. For a lot of other workloads in the vision space, most data needs to be labeled to be able to used for training.

cardine · on Aug 29, 2022

I have not found this to be true at all in my field (natural language generation).

We have a 7 figure GPU setup that is running 24/7 at 100% utilization just to handle inference.

dheera · on Aug 29, 2022

Also true of self-driving. You train a perception model for a week and then log millions of vehicle-hours on inference.

fartcannon · on Aug 29, 2022

How do you train new models if your GPUs are being used for inference? I guess the training happens significantly less frequently?

Forgive my ignorance.

cardine · on Aug 29, 2022

We have different servers for each. But the split is usually 80%/20% for inference/training. As our product grows in usage the 80% number is steadily increasing.

That isn't because we aren't training that often - we are almost always training many new models. It is just that inference is so computationally expensive!

godelski · on Aug 30, 2022

Are you training new models from scratch or just fine tuning LLMs? I'm from the CV side and we tend to train stuff from scratch because we're still highly focused on finding new architectures and how to scale. The NLP people I know tend to use LLMs and existing checkpoints so their experiments tend to be a lot cheaper.

Not that anyone should think any aspect (training nor inference) is cheap.

jacquesm · on Aug 29, 2022

Typically a different set of hardware for model training.

jonas21 · on Aug 29, 2022

Maybe from the researcher or data scientist's perspective. But if you have a product that uses ML and inference doesn't dominate training, you're doing it wrong.

MichaelBurge · on Aug 29, 2022

Think Google: Every time you search, some model somewhere gets invoked, and the aggregate inference cost would dwarf even very large training costs if you have billions of searches.

Marketing blogspam like this is always targeting big(not Google, but big) companies hoping to divert their big IT budgets to their coffers: "You have X million queries to your model every day. Imagine if we billed you per-request, but scaled the price so in aggregate it's slightly cheaper than your current spending."

People who are training-constrained are early-stage(i.e. correlate with not having money), and then they need to buy an entirely separate set of GPUs to support you(e.g. T4s are good for inference, but they need V100s for training). So they choose to ignore you entirely.

mgraczyk · on Aug 29, 2022

This depends a lot on what you're doing. If you are ranking 1M qps in a recommender system, then training cost will be tiny compared to inference.

lajamerr · on Aug 29, 2022

I wonder if there's room for model caching. Sacrifice some personalization for near similar results so you aren't hitting the model so often.

mgraczyk · on Aug 29, 2022

Yeah we did lots of things like this at Instagram. Can be very brittle and dangerous though to share any caching amongst multiple users. If you work at Facebook you can search for some SEVs related to this lol

Jensson · on Aug 29, 2022

If you are training models that are intended to be used in production at scale then training is dirt cheap compared to inference. There is a reason why Google focused on inference first with their TPU's even though Google does a lot of ML training.

dllthomas · on Aug 29, 2022

I think another part of the question is whether you're scaling on your own hardware or the customers' hardware.

jldugger · on Aug 29, 2022

> If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.

Wouldn't that depend on the size of your customer base? Or at least, requests per second?

karamanolev · on Aug 29, 2022

With more customers usually the revenue and profit grow, then the team becomes larger, wants to perform more experiments, spends more on training and so on. Inference is just so computationally cheap compared to training.

That's what I've seen in my experience, but I concur that there might be cases where the ML is a more-or-less solved problem for a very large customer base where inference is more. I've rarely seen it happen, but other people are sharing scenarios where it happens frequently. So I guess it massively depends on the domain.

gpm · on Aug 30, 2022

Alpha zero used 5000 TPUs to generate games (inference only), and 16 to train the networks.

The split definitely depends on what you're doing past developing/deploying.

(Source: https://kstatic.googleusercontent.com/files/2f51b2a749a284c2...)

varunkmohan · on Aug 30, 2022

Completely agreed. For some of these large language models, it would take a long time before inference spend dominates training spend.

acchow · on Aug 29, 2022

Is your inference running on some daily jobs? That's not a ton of inference compared to running online for every live request (10k QPS?)

gowld · on Aug 29, 2022

More to the point, you don't so training and inference in the same program, so don't have to be on the same hardware in the same machine. It's two separate problems with separate hardware solutions.

scosman · on Aug 29, 2022

We did a big analysis of this a few years back. We ended up using a big spot-instance cluster of CPU machines for our inference cluster. Much more consistently available than spot GPU, at greater scale, and at better price per inference (at least at the time). Scaled well to many billion inferences. Of course, compare cost per inference on your models to make sure logic applies. Article on how it worked: https://www.freecodecamp.org/news/ml-armada-running-tens-of-...

Training was always GPUs (for speed), non-spot-instance (for reliability), and cloud based (for infinite parallelism). Training work tended to be chunky, never made sense to build servers in house that would be idle some of the time, and queued at other times.

machinekob · on Aug 29, 2022

What cloud is even remotely worth it over buying 20x rtx 3090 or even some quadro for training? Maybe if u have very small team and small problems but if you have CV/Video tasks and team more than 3 maybe even 2 people in house servers are always better choice as you'll get your money back in 2-3 months of training over cloud solution and maybe even more if you wait for rtx 4090.

And if you are solo dev its even easier choice as you can reuse your rig for other stuff when you dont train anything (for example gaming :D).

Only possibility is if you get free 100k from AWS and then 100k from GCP you can live with that for a year or even two if u stack both providers but it is special case and im not sure how easy it is to get 100k right now.

scosman · on Aug 29, 2022

As mentioned in the comment, ML training workloads tend to be super chunky (at least in my experience). Some days we want to train 50 models, some weeks we are evaluating and don’t need any compute.

I’d rather be able to spin up 200 gpus in parallel when needed (yes, at a premium), but ramp to 0 when not. Data scientists waiting around are more expensive than GPUs. Replacing/maintaining servers is more work/money than you expect. And for us the training data was cloud native, so transfer/privacy/security is easier; nothing on prem, data scientists can design models without having access to raw data, etc.

machinekob · on Aug 29, 2022

If you are cloud only company then for sure it is just easier but still it wont be cheaper just more convenient to use. If data science team is very big probably "best" solution without unlimited money is just to run local and go cloud [premium] if you dont have free resources for your teams (It was the case when i was working in pretty big EU bank but it wasn't "true" Deep learning yet [about 4-5 years ago]).

varunkmohan · on Aug 29, 2022

You have a good point. I think for small enough workloads self managing instances on-prem is more cost-effective. There is a simplicity gain in being able to scale up and scale down instances in the cloud but may not make sense if you can self-manage without too much work.

beecafe · on Aug 29, 2022

You are years behind if you think you're training a model worth anything on consumer grade GPUs. Table stakes these days is 8x A100 pods, and lots of them. Luckily you can just get DGX pods so you don't have to build racks but for many orgs just renting the pods is much cheaper.

machinekob · on Aug 29, 2022

Ahh yes cause there is only one way to do Deep Learning and it is ofc stacking models large enough to not be useful outside pods of GPUs and this is for sure way to go if you want to make money (from VC ofc cause you wont have much users that are ever willing to pay so much that you'll ever make even, as was OpenAI and other big model providers, maybe you can get some money/sponsoring from state or uni).

Market for local small and efficient models running on device is pretty big maybe even biggest that exist right now [ios, android and macos are pretty easy to monetize with low cost models that are useful]. I can assure you of that and you can do it on even 4x RTX 3090 [ it wont be fast but you'll get there :) ]

cjbgkagh · on Aug 29, 2022

Years behind what? Table stakes for what? There is much more to ML than the latest transformer and diffusion models. While those get the attention the amount of research not in that space dominates.

mushufasa · on Aug 29, 2022

> You are years behind if you think you're training a model worth anything on consumer grade GPUs

Ah yes, my code can't be useful to people unless it takes a long time to compile...

skimo8 · on Aug 29, 2022

To be fair, I think ML workloads are quite a bit different than the days of compiling over lunch breaks.

What the above post was probably trying to get at is that the ML specific hardware is far more efficient these days than consumer GPUs.

rockemsockem · on Aug 29, 2022

300 billion parameters or GTFO, eh?

There is tons of value to be had from smaller models. Even some state of the art results can be obtained on a relatively small set of commodity GPUs. Not everything is GPT-scale.

macksd · on Aug 30, 2022

Isn't a key selling point of the latest, hottest model that's on the front page of Hacker News multiple times right now, the fact that it fits on consumer-grade GPUs? Surely some of the interesting ideas it's spawning right now are people doing transfer learning on GPUs that don't end in "100", don't you think?

fragmede · on Aug 30, 2022

for what it's worth, stable diffusion was trained on 32 x 8 x A100 GPUs

macksd · on Aug 30, 2022

You know there's a huge difference between training the original model and transfer learning to apply it to a new use case, right? Saying people are years behind if they think there work is only worth something with 8 A100 pods is pretty ignorant of how most applications get built. Not everyone's trying to design novel model architectures, nor should they.

potatoman22 · on Aug 30, 2022

Most models actually being used are linear regressions and decision trees

thebruce87m · on Aug 30, 2022

My model takes 6 hours to train on a 3090. People have different use cases.

varunkmohan · on Aug 29, 2022

Disclaimer: I'm the Cofounder / CEO at Exafunction

That's a great point. We'll be addressing this in an upcoming post as well.

We've served workloads that run entirely on spot GPUs where it makes sense since a small number of spot GPUs can make up for a large amount of spot CPU capacity. The best of all worlds is if you can manage both spot and on-demand instances (with a preference towards spot instances). Also, for latency sensitive workloads, running on spot instances or CPUs sometimes is not an option.

I could definitely see cases where it makes sense to run on spot CPUs though.

fortysixdegrees · on Aug 29, 2022

Disclaimer != Disclosure

Probably one of HNs most common mistakes in comments

fragmede · on Aug 29, 2022

The "so I'm biased and take my advice under advisement" is implied, so disclaimer works.

adgjlsfhk1 · on Aug 29, 2022

perhaps, but I think disclaimer in this context it's just an abbreviation since the disclosure carries with it the implicit disclaimer of "so the things I'm saying are subconsciously influenced by the fact that they potentially could make me money"

ramoz · on Aug 30, 2022

We did a similar analysis for GCP. Preemptibles/spot were the way to go with inference. CPU performance was also faster for our scaled inference workloads.

Times change though, we’re about to conduct the same analysis over again, with latest models better architected for accelerators.

37ef_ced3 · on Aug 29, 2022

For small-scale transformer CPU inference you can use, e.g., Fabrice Bellard's https://bellard.org/libnc/

Similarly, for small-scale convolutional CPU inference, where you only need to do maybe 20 ResNet-50 (batch size 1) per second per CPU (cloud CPUs cost $0.015 per hour) you can use inference engines designed for this purpose, e.g., https://NN-512.com

You can expect about 2x the performance of TensorFlow or PyTorch.

tombert · on Aug 29, 2022

Is there a thing that Fabrice Bellard hasn't built? I had no idea that he was interested in something like machine learning, but I guess I shouldn't have been surprised because he has built every tool that I use.

nl · on Aug 30, 2022

If you are in the "data compression ~= intelligence" camp then Fabrice Bellard is currently leading the race to AI too.

http://prize.hutter1.net/

https://bellard.org/nncp/

http://www.mattmahoney.net/dc/text.html

mistrial9 · on Aug 29, 2022

https://en.wikipedia.org/wiki/Fabrice_Bellard

Kukumber · on Aug 29, 2022

An interesting question, shows how insanely overpriced GPUs still are, specially in the cloud environment

NavinF · on Aug 29, 2022

*only in the cloud environment

Throw some 3090s in a rack and you’ll break even in 3 months

wmf · on Aug 30, 2022

Because that's "illegal" so cloud providers can't do it.

jgauth · on Aug 30, 2022

Can you describe what you mean by that?

wmf · on Aug 30, 2022

The GeForce driver EULA doesn't allow it to be used in servers or something like that, so clouds all have to use the more expensive professional cards.

pqn · on Aug 29, 2022

Disclaimer: I work at Exafunction

I empathize a bit with the cloud providers as they have to upgrade their data centers every few years with new GPU instances and it's hard for them to anticipate demand.

But if you can easily use every trick in the book (CPU version of the model, autoscaling to zero, model compilation, keeping inference in your own VPC, using spot instances, etc.) then it's usually still worth it.

lowdose · on Aug 29, 2022

Not to mention AWS has had a GPU cloud offering monopoly because Google Cloud and Microsoft Azure were publicly available until 2019.

fomine3 · on Aug 30, 2022

GCP still provides NVIDIA K80. I wonder is it still worth to hold.

varunkmohan · on Aug 30, 2022

I think you'd probably always want to go with T4's since they are the same price unless there's just no availability for them.

mistrial9 · on Aug 29, 2022

the HPC crowd are not able to add GPUs, that I know of.. deepLearning group of algorithms do kick butt for lots of kinds of problems+data .. though I will advocate that dl is NOT the only game in town, despite what you often read here

Frost1x · on Aug 29, 2022

In what context? HPC and certain code bases have been effectively leveraging heterogenous CPU GPU workloads for a variety of applications for quite awhile. I know of some doing so in at least 2009 and know plenty of prior art was already there by that point, it's just a specific time I happen to remember.

mistrial9 · on Aug 29, 2022

ok - the academic study in front of me dated 2020 says "no" but it is non-US researchers, public science. I have no reason to believe one way or the other, but I literally read this today.

reading again - it seems this paper calls HPC with GPUs a slightly different name "GPGPU" and lists the research activity separately.. so I didn't see it as HPC; basically what I wrote is not accurate. got it

synergy20 · on Aug 29, 2022

I think TPU is the way to go for ML, be it training or inference.

We're using GPU(some contains a TPU block inside) due to 'historical reasons'. With vector unit(x86 AVX, ARM SVE, RISC-V RVV) that is part of the host cpu, either put a TPU on a separate die of the chiplet, or just put it into a PCIe card will do the heavy lift ML job fine. It shall be much cheaper than the GPU model for ML nowadays, unless you are both a PC game player and a ML engineer.

why_only_15 · on Aug 30, 2022

It's true that we were initially using GPUs mostly for historical reasons, but over the last several years modern GPUs have been optimized for ML as much as anything else. If you read NVidia's marketing documents, they talk constantly about ML. The A100 is about as good, if not better, than the TPUv4 in terms of raw performance on ML workloads. The A100 can do 312 bf16 TFLOPs and costs $0.88/hr on Google Cloud [0] whereas the TPUv4 can do 275 bf16 TFLOPs and costs $0.97/hr on Google Cloud [1] [2]. The A100 is also generally speaking easier to program: it's supported by more frameworks and can perform more operations. The TPUv4 is in my understanding still worth it if you like JAX and/or you're doing lots of networking though.

WRT putting a TPU on a separate die -- this has been done for several years in the mobile space: Apple Neural Engine for iPhones, TPU (not same as server TPU) on Pixel, SNPE on Qualcomm, etc.

[0] https://cloud.google.com/compute/gpus-pricing

[1] https://cloud.google.com/tpu/pricing#v4-pricing

[2] this is somewhat unfair, because the GPU pricing number is for just the GPU and not the host it runs on, whereas the TPU pricing number (for TPU VMs) includes the host it runs on. If you include the price GCP charges for the host, preemptible A100s are about $1.20/hr. Why does Google make GPUs look cheaper than TPUs when they're not? Your guess is as good as mine.

synergy20 · on Aug 30, 2022

Maybe Google is favoring TPUv4 over whatever GPU runs on its platform?

With Hopper 100 on the way, I wonder when TPUv5 will come out.

I also wonder how Intel's Gaudi2 vs Ponte Vecchio will work together, looks like duplicate efforts for me.

AMD has its MI300 on the way, but it seems still far behind Nvidia|TPU|Intel at this point.

jacquesm · on Aug 29, 2022

This is an ad.

mpaepper · on Aug 29, 2022

This also very much depends on the inference use case / context. For example, I work in deep learning on digital pathology where images can be up to 100000x100000pixels in size and inference needs GPUs as it's just way too slow otherwise.

rfrey · on Aug 29, 2022

Not related to the article, but how would one begin to become smart on optimizing GPU workloads? I've been charged with deploying an application that is a mixture of heuristic search and inference, that has been exclusively single-user to this point.

I'm sure every little thing I've discovered (e.g. measuring cpu/gpu workloads, trying to multiplex access to the gpu, etc) was probably covered in somebody's grad school notes 12 years ago, but I haven't found a source of info on the topic.

pqn · on Aug 29, 2022

Let's just take the topic of measuring GPU usage. This alone is quite tricky -- tools like nvidia-smi will show full GPU utilization even if not all SMs are running. And also the workload may change behavior over time, if for instance inputs to transformers got longer over time. And then it gets even more complicated to measure when considering optimizations like dynamic batching. I think if you peek into some ML Ops communities you can get a flavor of these nuances, but not sure if there are good exhaustive guides around right now.

fancyfredbot · on Aug 29, 2022

There are some pretty elegant solutions out there for the problem of having the right ratio of CPU to GPU. One of the nicer ones is rCUDA. https://scholar.google.com/citations?view_op=view_citation&h...

varunkmohan · on Aug 29, 2022

rCUDA is super cool! One of the issues though is for a lot of the common model frameworks are not supported and a new release has not come out a while.

fancyfredbot · on Aug 29, 2022

Fair point. It's not obvious from the website which model frameworks does exafunction supports, or when the last exafunction release was.

varunkmohan · on Aug 30, 2022

Yeah, we should have a public release very soon for people to deploy internally. We will have support for all the commonly-used frameworks and different versions.

fancyfredbot · on Aug 30, 2022

Sounds awesome, look forward to it.

einpoklum · on Aug 29, 2022

> And CPUs are so much cheaper

Doesn't look like it. Consumer:

AMD ThreadRipper 3970X: ~3000 USD on NewEgg

https://www.newegg.com/amd-ryzen-threadripper-2990wx/p/N82E1...

NVIDIA RTX 3080 Ti Founders' Edition: ~2000 USD

https://www.newegg.com/nvidia-900-1g133-2518-000/p/1FT-0004-...

For servers, a comparison is even more complicated and it wouldn't be fair to just give two numbers, but I still don't think GPUs are more expensive.

... besides, none of that may matter if yours is a power budget.

fomine3 · on Aug 30, 2022

Consumer GPUs are very cheap but prohibited to use it on datacenter.

einpoklum · on Aug 30, 2022

In a datacenter you need to compare Xeon's and Epyc's with Telsa's.

fomine3 · on Aug 30, 2022

Technically no one prevents using Core/Ryzen series on datacenter.

atq2119 · on Aug 30, 2022

That only applies to Nvidia GPUs.

fomine3 · on Aug 31, 2022

Ah yes but recent consumer RADEONs are not suitable for computing task (ROCm still experimental?), while Geforce is always fine for FP32 or below.

ummonk · on Aug 29, 2022

What a clickbaity article. It’s an interesting discussion of GPU multiplexing for ML inference merged together with a sales pitch but the clickbait title made me hate the article bait and switch. This wasn’t even an example of Betteridge’s law but just completely misleading headline.

Eridrus · on Aug 29, 2022

Is everyone with relevant inference costs not doing this already?

I am so confused how there seems to be a startup around having a work queue that does batching...

PeterStuer · on Aug 29, 2022

" It feels wasteful to have an expensive GPU sitting idle while we are executing the CPU portions of the ML workflow"

What is expensive? Those 3090ti's are looking very tasteful at current prices.

andrewmutz · on Aug 29, 2022

At training time they sure are. The only thing more expensive than fancy GPUs are the ML engineers whose productivity that are improving.

jimmygrapes · on Aug 30, 2022

Perhaps it's been mentioned before but I do find it curious how often crypto mining was lambasted for contributing to climate change get I haven't seen anybody bat an eye at a fairly similar amount of compute power used for ML applications. Makes me wonder.

atq2119 · on Aug 30, 2022

The two are quite different when you look at the cost/benefit ratio.

triknomeister · on Aug 29, 2022

I thought this post would be about how ASICs are probably a better bet.

sabotista · on Aug 29, 2022

It depends a lot on your problem, of course.

Game-playing (e.g. AlphaGo) is computationally hard but the rules are immutable, target functions (e.g., heuristics) don’t change much, and you can generate arbitrarily sized clean data sets (play more games). On these problems, ML-scaling approaches work very well. For business problems where the value of data decays rapidly, though, you probably don’t need the power of a deer or complex neural net with millions of parameters, and expensive specialty hardware probably isn’t worth it.

rvz · on Aug 30, 2022

Not only the end result of these deep learning models can be tricked over a single pixel or get confused by malicious input and becomes useless, Deep Learning training, retraining, fine tuning on GPUs, TPUs, all running in a data center contribute significantly to burning up the planet and driving up costs which the models are just used for nothing but surveillance on our own data.

If it doesn't work it has to be retrained on new data again and there are no efficient alternatives to this energy waste other than use more GPUs, TPUs, etc emitting more CO2 after years of Deep Learning existing.

A complete waste of resources and energy. Therefore it is not worth it at all.

visarga · on Aug 30, 2022

Why so negative? It's a waste only if the value provided is less than the cost. You can't decide that with only the cost.

As humans we have our own adversarial examples, we get tired, we get sloppy, we might be even more biased than a calibrated model and always much more expensive.

rvz · on Aug 30, 2022

> Why so negative? It's a waste only if the value provided is less than the cost.

It is entirely true and it just takes an invalid input to trick them and it messes up easily and even worse when there are always biases involved. Thus the value is nullified.

And once that model breaks and doesn't work, what is the solution? More retraining on new data? Even with that like I said there are ZERO efficient alternatives, which the cost outweighs the benefits.

Therefore, it is not even worth it.