For some reason they focus on the inference, which is the computationally cheap part. If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.
Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.
The way I see it, generally every new data point (on which the production model inference gets run once) becomes part of the data set which then gets used in training every next model, processing the same data point many more times in training, thus training unavoidably taking more effort than inference.
Perhaps I'm a bit biased towards all kinds of self-supervised or human-in-the-loop or semi-supervised models, but the notion of discarding large amounts of good domain-specific data that get processed only for inference and not used for training afterward feels a bit foreign to me, because you usually can extract an advantage from it. But perhaps that's the difference between data-starved domains and overwhelming-data domains?
What you say re saving all data is the ideal. I'd add a couple caveats, one is that in many fields you often get lots of redundant data that adds nothing to training (for example if an image classifier looking for some rare class you can be drowning in images of the majority class). Or you can just have lots of data that is unambiguously and correctly classified- some kind of active learning can tell you what is worth keeping.
The other is that for various reasons the customer doesnt want to share their data (or at least have sharing built into the inference system) so even if you'd like to have everything they record, it's just not available. Obviously something to discourage but it seems common
There's one piece of the puzzle you're missing: field-deployed devices.
If I play chess on my computer, the games I play locally won't hit the Stockfish models. When I use the feature on my phone that allows me to copy text from a picture, it won't phone home with all the frames.
Yup, exactly. It's a good point that for self-supervised workloads, the training set can become arbitrarily large. For a lot of other workloads in the vision space, most data needs to be labeled to be able to used for training.
We have different servers for each. But the split is usually 80%/20% for inference/training. As our product grows in usage the 80% number is steadily increasing.
That isn't because we aren't training that often - we are almost always training many new models. It is just that inference is so computationally expensive!
Are you training new models from scratch or just fine tuning LLMs? I'm from the CV side and we tend to train stuff from scratch because we're still highly focused on finding new architectures and how to scale. The NLP people I know tend to use LLMs and existing checkpoints so their experiments tend to be a lot cheaper.
Not that anyone should think any aspect (training nor inference) is cheap.
Maybe from the researcher or data scientist's perspective. But if you have a product that uses ML and inference doesn't dominate training, you're doing it wrong.
Think Google: Every time you search, some model somewhere gets invoked, and the aggregate inference cost would dwarf even very large training costs if you have billions of searches.
Marketing blogspam like this is always targeting big(not Google, but big) companies hoping to divert their big IT budgets to their coffers: "You have X million queries to your model every day. Imagine if we billed you per-request, but scaled the price so in aggregate it's slightly cheaper than your current spending."
People who are training-constrained are early-stage(i.e. correlate with not having money), and then they need to buy an entirely separate set of GPUs to support you(e.g. T4s are good for inference, but they need V100s for training). So they choose to ignore you entirely.
Yeah we did lots of things like this at Instagram. Can be very brittle and dangerous though to share any caching amongst multiple users. If you work at Facebook you can search for some SEVs related to this lol
If you are training models that are intended to be used in production at scale then training is dirt cheap compared to inference. There is a reason why Google focused on inference first with their TPU's even though Google does a lot of ML training.
With more customers usually the revenue and profit grow, then the team becomes larger, wants to perform more experiments, spends more on training and so on. Inference is just so computationally cheap compared to training.
That's what I've seen in my experience, but I concur that there might be cases where the ML is a more-or-less solved problem for a very large customer base where inference is more. I've rarely seen it happen, but other people are sharing scenarios where it happens frequently. So I guess it massively depends on the domain.
More to the point, you don't so training and inference in the same program, so don't have to be on the same hardware in the same machine. It's two separate problems with separate hardware solutions.
We did a big analysis of this a few years back. We ended up using a big spot-instance cluster of CPU machines for our inference cluster. Much more consistently available than spot GPU, at greater scale, and at better price per inference (at least at the time). Scaled well to many billion inferences. Of course, compare cost per inference on your models to make sure logic applies. Article on how it worked: https://www.freecodecamp.org/news/ml-armada-running-tens-of-...
Training was always GPUs (for speed), non-spot-instance (for reliability), and cloud based (for infinite parallelism). Training work tended to be chunky, never made sense to build servers in house that would be idle some of the time, and queued at other times.
What cloud is even remotely worth it over buying 20x rtx 3090 or even some quadro for training?
Maybe if u have very small team and small problems but if you have CV/Video tasks and team more than 3 maybe even 2 people in house servers are always better choice as you'll get your money back in 2-3 months of training over cloud solution and maybe even more if you wait for rtx 4090.
And if you are solo dev its even easier choice as you can reuse your rig for other stuff when you dont train anything (for example gaming :D).
Only possibility is if you get free 100k from AWS and then 100k from GCP you can live with that for a year or even two if u stack both providers but it is special case and im not sure how easy it is to get 100k right now.
As mentioned in the comment, ML training workloads tend to be super chunky (at least in my experience). Some days we want to train 50 models, some weeks we are evaluating and don’t need any compute.
I’d rather be able to spin up 200 gpus in parallel when needed (yes, at a premium), but ramp to 0 when not. Data scientists waiting around are more expensive than GPUs. Replacing/maintaining servers is more work/money than you expect. And for us the training data was cloud native, so transfer/privacy/security is easier; nothing on prem, data scientists can design models without having access to raw data, etc.
If you are cloud only company then for sure it is just easier but still it wont be cheaper just more convenient to use.
If data science team is very big probably "best" solution without unlimited money is just to run local and go cloud [premium] if you dont have free resources for your teams (It was the case when i was working in pretty big EU bank but it wasn't "true" Deep learning yet [about 4-5 years ago]).
You have a good point. I think for small enough workloads self managing instances on-prem is more cost-effective. There is a simplicity gain in being able to scale up and scale down instances in the cloud but may not make sense if you can self-manage without too much work.
You are years behind if you think you're training a model worth anything on consumer grade GPUs. Table stakes these days is 8x A100 pods, and lots of them. Luckily you can just get DGX pods so you don't have to build racks but for many orgs just renting the pods is much cheaper.
Ahh yes cause there is only one way to do Deep Learning and it is ofc stacking models large enough to not be useful outside pods of GPUs and this is for sure way to go if you want to make money (from VC ofc cause you wont have much users that are ever willing to pay so much that you'll ever make even, as was OpenAI and other big model providers, maybe you can get some money/sponsoring from state or uni).
Market for local small and efficient models running on device is pretty big maybe even biggest that exist right now [ios, android and macos are pretty easy to monetize with low cost models that are useful].
I can assure you of that and you can do it on even 4x RTX 3090 [ it wont be fast but you'll get there :) ]
Years behind what? Table stakes for what? There is much more to ML than the latest transformer and diffusion models. While those get the attention the amount of research not in that space dominates.
There is tons of value to be had from smaller models. Even some state of the art results can be obtained on a relatively small set of commodity GPUs. Not everything is GPT-scale.
Isn't a key selling point of the latest, hottest model that's on the front page of Hacker News multiple times right now, the fact that it fits on consumer-grade GPUs? Surely some of the interesting ideas it's spawning right now are people doing transfer learning on GPUs that don't end in "100", don't you think?
You know there's a huge difference between training the original model and transfer learning to apply it to a new use case, right? Saying people are years behind if they think there work is only worth something with 8 A100 pods is pretty ignorant of how most applications get built. Not everyone's trying to design novel model architectures, nor should they.
Disclaimer: I'm the Cofounder / CEO at Exafunction
That's a great point. We'll be addressing this in an upcoming post as well.
We've served workloads that run entirely on spot GPUs where it makes sense since a small number of spot GPUs can make up for a large amount of spot CPU capacity. The best of all worlds is if you can manage both spot and on-demand instances (with a preference towards spot instances). Also, for latency sensitive workloads, running on spot instances or CPUs sometimes is not an option.
I could definitely see cases where it makes sense to run on spot CPUs though.
perhaps, but I think disclaimer in this context it's just an abbreviation since the disclosure carries with it the implicit disclaimer of "so the things I'm saying are subconsciously influenced by the fact that they potentially could make me money"
We did a similar analysis for GCP. Preemptibles/spot were the way to go with inference. CPU performance was also faster for our scaled inference workloads.
Times change though, we’re about to conduct the same analysis over again, with latest models better architected for accelerators.
For small-scale transformer CPU inference you can use, e.g., Fabrice Bellard's https://bellard.org/libnc/
Similarly, for small-scale convolutional CPU inference, where you only need to do maybe 20 ResNet-50 (batch size 1) per second per CPU (cloud CPUs cost $0.015 per hour) you can use inference engines designed for this purpose, e.g., https://NN-512.com
You can expect about 2x the performance of TensorFlow or PyTorch.
Is there a thing that Fabrice Bellard hasn't built? I had no idea that he was interested in something like machine learning, but I guess I shouldn't have been surprised because he has built every tool that I use.
The GeForce driver EULA doesn't allow it to be used in servers or something like that, so clouds all have to use the more expensive professional cards.
I empathize a bit with the cloud providers as they have to upgrade their data centers every few years with new GPU instances and it's hard for them to anticipate demand.
But if you can easily use every trick in the book (CPU version of the model, autoscaling to zero, model compilation, keeping inference in your own VPC, using spot instances, etc.) then it's usually still worth it.
the HPC crowd are not able to add GPUs, that I know of.. deepLearning group of algorithms do kick butt for lots of kinds of problems+data .. though I will advocate that dl is NOT the only game in town, despite what you often read here
In what context? HPC and certain code bases have been effectively leveraging heterogenous CPU GPU workloads for a variety of applications for quite awhile. I know of some doing so in at least 2009 and know plenty of prior art was already there by that point, it's just a specific time I happen to remember.
ok - the academic study in front of me dated 2020 says "no" but it is non-US researchers, public science. I have no reason to believe one way or the other, but I literally read this today.
reading again - it seems this paper calls HPC with GPUs a slightly different name "GPGPU" and lists the research activity separately.. so I didn't see it as HPC; basically what I wrote is not accurate. got it
I think TPU is the way to go for ML, be it training or inference.
We're using GPU(some contains a TPU block inside) due to 'historical reasons'. With vector unit(x86 AVX, ARM SVE, RISC-V RVV) that is part of the host cpu, either put a TPU on a separate die of the chiplet, or just put it into a PCIe card will do the heavy lift ML job fine. It shall be much cheaper than the GPU model for ML nowadays, unless you are both a PC game player and a ML engineer.
It's true that we were initially using GPUs mostly for historical reasons, but over the last several years modern GPUs have been optimized for ML as much as anything else. If you read NVidia's marketing documents, they talk constantly about ML. The A100 is about as good, if not better, than the TPUv4 in terms of raw performance on ML workloads. The A100 can do 312 bf16 TFLOPs and costs $0.88/hr on Google Cloud [0] whereas the TPUv4 can do 275 bf16 TFLOPs and costs $0.97/hr on Google Cloud [1] [2]. The A100 is also generally speaking easier to program: it's supported by more frameworks and can perform more operations. The TPUv4 is in my understanding still worth it if you like JAX and/or you're doing lots of networking though.
WRT putting a TPU on a separate die -- this has been done for several years in the mobile space: Apple Neural Engine for iPhones, TPU (not same as server TPU) on Pixel, SNPE on Qualcomm, etc.
[2] this is somewhat unfair, because the GPU pricing number is for just the GPU and not the host it runs on, whereas the TPU pricing number (for TPU VMs) includes the host it runs on. If you include the price GCP charges for the host, preemptible A100s are about $1.20/hr. Why does Google make GPUs look cheaper than TPUs when they're not? Your guess is as good as mine.
This also very much depends on the inference use case / context. For example, I work in deep learning on digital pathology where images can be up to 100000x100000pixels in size and inference needs GPUs as it's just way too slow otherwise.
Not related to the article, but how would one begin to become smart on optimizing GPU workloads? I've been charged with deploying an application that is a mixture of heuristic search and inference, that has been exclusively single-user to this point.
I'm sure every little thing I've discovered (e.g. measuring cpu/gpu workloads, trying to multiplex access to the gpu, etc) was probably covered in somebody's grad school notes 12 years ago, but I haven't found a source of info on the topic.
Let's just take the topic of measuring GPU usage. This alone is quite tricky -- tools like nvidia-smi will show full GPU utilization even if not all SMs are running. And also the workload may change behavior over time, if for instance inputs to transformers got longer over time. And then it gets even more complicated to measure when considering optimizations like dynamic batching. I think if you peek into some ML Ops communities you can get a flavor of these nuances, but not sure if there are good exhaustive guides around right now.
rCUDA is super cool! One of the issues though is for a lot of the common model frameworks are not supported and a new release has not come out a while.
Yeah, we should have a public release very soon for people to deploy internally. We will have support for all the commonly-used frameworks and different versions.
What a clickbaity article. It’s an interesting discussion of GPU multiplexing for ML inference merged together with a sales pitch but the clickbait title made me hate the article bait and switch. This wasn’t even an example of Betteridge’s law but just completely misleading headline.
Perhaps it's been mentioned before but I do find it curious how often crypto mining was lambasted for contributing to climate change get I haven't seen anybody bat an eye at a fairly similar amount of compute power used for ML applications. Makes me wonder.
Game-playing (e.g. AlphaGo) is computationally hard but the rules are immutable, target functions (e.g., heuristics) don’t change much, and you can generate arbitrarily sized clean data sets (play more games). On these problems, ML-scaling approaches work very well. For business problems where the value of data decays rapidly, though, you probably don’t need the power of a deer or complex neural net with millions of parameters, and expensive specialty hardware probably isn’t worth it.
Not only the end result of these deep learning models can be tricked over a single pixel or get confused by malicious input and becomes useless, Deep Learning training, retraining, fine tuning on GPUs, TPUs, all running in a data center contribute significantly to burning up the planet and driving up costs which the models are just used for nothing but surveillance on our own data.
If it doesn't work it has to be retrained on new data again and there are no efficient alternatives to this energy waste other than use more GPUs, TPUs, etc emitting more CO2 after years of Deep Learning existing.
A complete waste of resources and energy. Therefore it is not worth it at all.
Why so negative? It's a waste only if the value provided is less than the cost. You can't decide that with only the cost.
As humans we have our own adversarial examples, we get tired, we get sloppy, we might be even more biased than a calibrated model and always much more expensive.
> Why so negative? It's a waste only if the value provided is less than the cost.
It is entirely true and it just takes an invalid input to trick them and it messes up easily and even worse when there are always biases involved. Thus the value is nullified.
And once that model breaks and doesn't work, what is the solution? More retraining on new data? Even with that like I said there are ZERO efficient alternatives, which the cost outweighs the benefits.