Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to train large deep learning models as a startup (assemblyai.com)
273 points by dylanbfox on Oct 7, 2021 | hide | past | favorite | 81 comments


Several hints here are severely outdated.

For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.

Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.


Hi there - OP here - thanks for reading!

This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.

If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.

PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!


> Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

Do you have an alternative recommendation?


You can check out some of the benchmarks here: https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.


I definitely enjoyed reading your article!

Did you play around with any AI-specific accelerators (eg TPUs?)

Looking at some basic cost analysis from a stranger on the Internet - https://medium.com/bigdatarepublic/cost-comparison-of-deep-l... - you can probably get a decent price reduction in training, especially using preemptive instances (and perhaps a better pricing contract with Google/AWS)

It's kind of crazy how the shortage of GPUs is affecting pricing on physical devices. My RTX Titan I bought in 2019 for $2,499 runs almost $5k on Amazon and is in short supply. The Titan V options you linked (although I think theres a typo because you referred it it as a Titan X) is an option - but it is still super overpriced for it's performance. Of course, this will probably settle down in the next year or two, and by then there will be new GPUs that are ~2-4x flop/$ compared to the V100/A100.


At these sizes, tpu would definitely be the way to go, and would likely be a lot cheaper (and potentially faster) than GPUs.


Last I've checked (a year or two ago) PyTorch support for TPU's were atrocious. Have they gotten any better?


PyTorch XLA is mature backend. In fact several other accelerators support PyTorch by lowering from XLA.



> How to train large deep learning models as a startup

How to train large deep learning models at a well founded startup*

Everything described here is absolutely not affordable by bootstrappers and startups with little funding, unless the model to train is not that deep.


As a bootstrapper I camped all night outside of bestbuy to get some 3090s.

Other tips not mentioned in the article:

1. Tune your hyper parameters on a subset of the data.

2. Validate new methods with smaller models on public datasets.

3. Tune models instead of training from scratch (either public models or your previously trained ones).


Great hacks, although you have to be aware of the trade-offs:

1. if you choose the wrong subset, you'll find a non optimum local min

2. still risk dead ends when expanding the model and lengthen the time to finding that out

3. a lot of public models are made from inaccurate datasets, so beware

Overall you have to start somewhere though, and your points are still valid.


1. The small subset is to test that your training pipeline works and converges near 0 loss.

2. Sure, but for most new hacks like mixup, randaugment and etc the results usually transfer over. Problem with deep learning is that most of the new results don't replicate so it's good to have a way to quickly validate things.

3. The lower level features are usually pretty data agnostic and transfer well to new tasks.


1. Gradient descent almost always finds a non optimum local min (it is not guaranteed to find a global min).


Isn’t the current best practice to train highly over-parametrized models to zero training error? That’d be a global optima, no?

Unless we’re talking about the optima of test error.


If you find a zero in a non negative function, I would call that a global minima, yes.


Yeah but depending on the data you might have even worse results, selecting the right subset to be representative is really important.


Would a random sample be representative? Statistically this seems to be the case for any large N. In fact it's not clear to me that any other sample would be more representative.


Many public datasets have skewed classes so if you take a random approach you're not gonna have a good result. And N might not be big enough anyway.


Check out Determined https://github.com/determined-ai/determined to help manage this kind of work at scale: Determined leverages Horovod under the hood, automatically manages cloud resources and can get you up on spot instances, T4's, etc. and will work on your local cluster as well. Gives you additional features like experiment management, scheduling, profiling, model registry, advanced hyperparameter tuning, etc.

Full disclosure: I'm a founder of the project.


Oh hey I interviewed with y'all a few years back, glad to see you're still around.


Interesting. How do you guys manage spot interruptions when training on spot instances?


Users expose their model to our Trial API (https://docs.determined.ai/latest/topic-guides/model-definit...), the base class then implements a training loop (which can be enhanced with user-supplied callbacks, metrics, etc.) that has a whole bunch of bells and whistles. Easy distributed (multi-GPU and multi-node) training, automatic checkpointing, fault tolerance, etc.

Concretely, the system is regularly taking checkpoints (which include model weights and optimizer state) and so if the spots disappear (as they do), the system has enough information to resume from where things were last checkpointed when resources become available again.


Thanks for going open source!


Excellent and informative article--and a good bit of brand-building, I might say :-). One thing I'd love to see more writing about is prototyping and iterative development in these contexts--deep NNs are notoriously hard to get "right", and there seems to be a constant tension between model architecting, tuning hyperparameters, etc.--for example, you presumably don't want to have to wait a couple of weeks (and burn through thousands of dollars) seeing if one choice of hyperparameters works well for your chosen architecture.

Of course, some development practices, such as ensuring that your loss function works in a basic sense, are covered in many places. But I'd love to see more in-depth coverage of architecture development & development best practices. Does anyone know of any particularly good resources / discussions there?


This is an awesome blog post by Andrej Karpathy (the Director of AI at Tesla) about his recipe for training neural networks: https://karpathy.github.io/2019/04/25/recipe/


I would like to second this. Thanks for linking this. As someone starting out in deep learning and noticing that a lot of things are still more art than science this seems great for avoiding some footguns!


Thank you!


If you wanted to do something like "OK Google" with AssemblyAI would you have to transcribe everything and then process the substring "OK Google" on the application layer (and therefore incur all of the cost of listening constantly)?

It'd be cool if there was the ability to train a phrase locally on your own premises and then use that to begin the real transcription.

This probably wouldn't be super difficult to build, but was wondering if it was available (didn't see anything at a glance)


Great question. This is technically referred to as "Wake Word Detection". You run a really small model locally that is just processing 500ms (for example) of audio at a time through a light weight CNN or RNN. The idea here is that it's just binary classification (vs actual speech recognition).

There are some open source libraries that make this relatively easy:

- https://github.com/Kitt-AI/snowboy (looks to be shutdown now) - https://github.com/cmusphinx/pocketsphinx

This avoids having to stream audio 24x7 to a cloud model which would be super expensive. This being said, I'm pretty sure what the Alexa does, for example, is send any positive wake word to a cloud model (that is bigger and more accurate) to verify the prediction of the local wake word detection model AFAIK.

Once you are positive you have a positive wake word detected - that's when you start streaming to an accurate cloud based transcription model like Assembly to minimize costs!


The search term you're looking for is "Keyword Spotting" (or "Wake Word Detection") - and that's what's implemented locally for ~embedded devices that sit and wait for something relevant to come along so that they know when to start sending data up to the mothership (or even turn on additional higher-power cores locally).

Here's an example repo that might be interesting (from initial impressions, though there are many more out there) : https://github.com/vineeths96/Spoken-Keyword-Spotting


Bose used to have some pre internet system that recognized the song you liked to play right after another song (like in a random shuffle) and attempted to learn what you liked to hear, and queue up the song you were likely to skip to anyway. No idea how they pulled it off since this must have been on hardware from 15 years ago iirc.


Ah yes Bose uMusic. From the manual it extracts 30 feature points from the songs to define your preference.

uMusic patent: https://patents.google.com/patent/CN1637743A/en

Further reading: http://products.bose.com/pdf/customer_service/owners/uMusic_...


This is actually a much simpler task than ASR and you can even easily train on a normal CPU even.

The best do it yourself instructions are in a book called Tiny ML.

Compared to super deep transformers, you'll find that deployed WW detectors are as simple as SVMs or 2 layer NNs.


> that still adds up to $2,451,526.58 to run 1,024 A100 GPUs for 34 days

Salary costs are probably even higher than compute costs. Automatic Speech Recognition is an industrial scale application, it costs a lot to train, but so do many other projects in different fields. How expensive is a plane or a ship? How much can a single building cost? A rocket launch?


In what way are salary costs higher? This is on the order of 10 of their people’s annual salaries. This is for a single training run (meaning overall compute costs are higher), and it isn’t the only thing those ten or so people would have done that year (also meaning overall compute costs are higher).


You have to work for a long time with a whole team on such a model. It adds up. In my experience there is a lot of work - data pipeline, labeling and data quality, the training, evaluations, deployment and making the deployed model efficient. And then there is also bias analysis, collecting failure cases and iterating on the data engine, and trivial things like measuring usage and billing.


yeah, but a cluster of the resulting model can transcribe thousands of hours of speech / second, 24/7 with a fixed accuracy, what can 10 humans do?


Huh. You and I have two very different readings on this. I'm talking about the ML researcher's time (what if they hired more people instead) and you're talking about human text processors (what if people did this work by hand instead). Kinda neat that we had such different readings.


This is an entire seed round's worth of money on an operational expenditure.


> Salary costs are probably even higher than compute costs.

Yes exactly. Managing that much compute requires many humans!


I wouldn't be so sure :-)


in my experience it's often more like "just use linear regression and tell everyone you're using AI"


Thats for structured data, for non structured it's more like "create a NN and stack more layers until you have your MVP"


> "create a NN and stack more layers until you have your MVP"

I mean, that's a pretty good principled approach to a lot of ML problems.


I think you have a different definition of "principled" from most people.


I'm very curious as to what part of that process is not explained by the principles by which we understand neural networks to work.

I invite the possibility I've gone this long misunderstanding the definition of "principled" in this context.


To me, taking "principled approach" means you understand and can justify the eventual outcome of the approach, or at least guarantee that the outcome satisfies some constraints. How would you justify the number of channels in each layer of a convolutional network? The number of self-attention heads in a transformer? The depth? Can you certify its prediction performance?

Yes, the "just add more layers" approach typically works (in a very narrow sense of the word "works"), but we don't really understand why. We likewise don't understand the failure modes of the system, and cannot engineer around them. Thus it's not really principled in my view.


Only because currently ML is more alchemy than engineering. We mix stuff until we make gold while we can't explain why more parameters generalize better instead of overfitting.


No, it's "load pretrained resnet and finetune on a few examples". Nobody trains from scratch today except the researchers with large budgets.


it's more like try different off-the-shelf models on some sample of data until the performance is somewhat acceptable

Unless you're Google, who even trains models from scratch these days, at most you do some fine-tuning


Lol, very true haha. In actuality, I don't think most NN's are any more 'AI' than simpler models. The definition for AI is fleeting, though.


the serious tip here is to go with gradient boosting which very often works so well it hardly makes a difference


Does anyone use this? How does AssemblyAI compare to Google’s? We are considering adding speech recognition to a small part of our product.


Dylan from Assembly here. Most of our customers have actually switched over to us from Google - this Launch HN from a YC startup that uses our API goes into a bit more detail if you're interested:

https://news.ycombinator.com/item?id=26251322

My email is in my profile if you want to reach out to chat more!


We run 10's of thousands of hours of audio through Assembly AI each day. We did a boatload of benchmarking on manually transcribed audio when we decided to use them and they were by far the best across the usual suspects (Amazon, etc.) and against smaller startups. They've only gotten better in the 2-3 years we've been using them


I believe most people already moved to offline engines. No need to send the data to some random guys like this Assembly. Nemo Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk. There are dozen options. And the cost is $0.01 per hour, not $0.89 per hour like here.

Another advantage is that you can do more custom things - add words to vocabulary, detect speakers with biometric features, detect emotions.


without talking about accuracy any comparison is meaningless.


You don't even need to compare accuracy, you can just check the technology. Facebook model is trained on 256 GPU cards and you can fine-tune it to your domain in a day or two. The release was 2 month ago. There is no way any cloud startup can have something better in production given they have access to just 4 Titan cards.


Also curious, are there any 'independent' performance benchmarks in this space?


This is tricky. The de facto metric to evaluate an ASR model is Word Error Rate (WER). But results can vary widely depending on the pre-processing that's done (or not done) to transcription text before calculating a WER.

For example if you take the WER of "I live in New York" and "i live in new york" the WER would be 60% because you're comparing a capitalized version vs an uncapitalized version.

This is why public WER results vary so widely.

We publish our own WER results and normalize the human and automatic transcription text as much as possible to get as close to "true" numbers as possible. But in reality, we see a lot of people comparing ASR services simply by doing diffs of transcripts.


I used both Google's speech-to-text APIs and Assembly's APIs as well as some other ones to build Twilio Voice phone calling applications. The out of the box accuracy was way better with Assembly and its far easier to quickly customize the language model for higher accuracy in specific domains (for example programming language keywords). Generally I avoid using Google APIs whenever possible since they always seem overly complicated to get started with and have incomplete documentation even when I'm working in Python which should be one of the better supported languages.


I would strongly advise against using Google's ML apis.

First, at my company Milk Video, we are huge fans of Assembly AI. The quality, speed and cost of their transcription is galaxies beyond the competition.

Having worked in machine learning focused companies for a few years, I have been researching this exact question. I'm curious how I can better forecast the amount of ML talent I should expect to build into our team (we are a seed stage company), and how much I can confidently outsource to best-in-class.

A lot of the ML services we use now are utilities that we don't want to manage (speech-to-text, video content processing, etc), and also want to see improve. We took a lot of time to decide who we outsource these things to, like working with AssemblyAI, because we were very conscious of the pace of improvement in speech-to-text quality.

When we were comparing products, the most important questions were:

1. How accurate is the speech-to-text API

1.a Word error rate

1.b Time attributed to start/end word

2. How fast does it process our content

3. How much does it cost

AssemblyAI was the only tool that used modern web patterns (ie. not Googles horrible API or other non-tech based companies trying to provide transcript services) that made it easy to integrate with in a short Sunday morning. The API is also surprisingly better than other speech-to-text services, because its trained for the kind of audio/video content being produced today (instead of old call center data, or perfect audio from studio-grade media).

Google's api forced you to manage your asset hosting in GCP, handle tons of unnecessary configuration around auth/file access/identity, and its insanely slow/inaccurate. Some other transcription services we used were embarrassingly horrible from a developer experience perspective, in that they also required you to actually talk to a person before giving you access.

The reason Assembly is so great is that you can literally make an API request with a media file url (video or audio), and boom, you get a nice intuitive JSON formatted transcript response. You can also add params to get speakers, get topic analysis, personal information detection, and it's just a matter of changing the payload in the first API request.

I'm very passionate about this because I spent so much time fighting previously implemented transcript services, and want to help anyone avoid the pain because Assembly really does it correctly.


How good is their speaker labeling? We've been using the Google API but their diarization has been basically unusable for our application (transcripts of group conversations).


Dylan from Assembly here. If you want to send me one of your audio files (my email is in my profile) I'd be happy to send you back the diarized results from our API.

You can also signup for a free account and test from the dashboard without having to write any code if that's easier.

Other than lots of crosstalk in your group conversations - is there anything else challenging about your audio (eg, distance from microphones, background noise, etc?)


We use assemblyai at our YC startup https://pickleai.com for our transcripts and deploy our own sentiment and summary models to help users take more efficient notes on Zoom calls! Super happy with them!


Maybe relevant in context: you can now use Siri offline transcription inside your apps. (for free)


This doesn’t answer the question at all, but huggingface also has some decent ASR models available.


Huggingface ASR models are not really recommended. The simple fact they don't use beam decoder with LM makes them much less accurate for practical applications. If you compare them to setups like Nemo + pyctcdecode, they will be 30% less accurate.

Also, most of the models there are undertrained.


This is an excellent article, which does a good job of detailing several factors involved here. But while it does suggest several ways to reduce the cost of training models, I'm left with a huge questions at the end.

How much does it ultimately cost to train a model at this size, and is it feasible to do without VS funding (and cloud credits)?


Author here. Thanks for your comments!

In general - this is expensive stuff. Training big, accurate models just requires a lot of compute, and there is a "barrier to entry" wrt costs, even if you're able to get those costs down. I think it's similar to startups not really being able to get into the aerospace industry unless they raise lots of funding (ie, Boom Supersonic).

Practically speaking though, for startups without funding, or access to cloud credits, my advice would be to just train the best model you can, with the compute resources you have available. Try to close your first customer with an "MVP" model. Even if your model is not good enough for most customers - you can close one, get some incremental revenue, and keep iterating.

When we first started (2017), I trained models that were ~1/10 the size of our current models on a few K80s in AWS. These models were much worse compared to our models today, but they helped us make incremental progress to get to where we are now.


One thing to note on the "Train with lower precision" is on newer hardware with TF32 support that gives you much of the speedup of FP16 without being as finicky. Doesn't save memory, but still useful. Automatic in PyTorch, not sure in TensorFlow:

https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-...

This is mostly important because these settings can significantly affect the price/perf evaluation for your specific model & the available hardware.


the hardest part here is horizontal scaling. OpenAI handrolled its own MPI+SSH stack (https://openai.com/blog/scaling-kubernetes-to-7500-nodes/)

I wonder what is the state of art for horizontal scaling here ...preferably on kubernetes.

Pytorch is tricky to integrate (using TorchElastic). You could use Dask or Ray Distributed. Tensorflow has its own mechanism that doesnt play nice with Kubernetes.

How are others doing it ?


I founded a company where we train a lot of machine learning models on music. We aren't quite at AssemblyAI's scale yet, but here is how I built my company's first on-premise GPU cluster to get us started:

1. Purchase GPU machines from Lambda Labs. I went with machines with 256 GB of CPU RAM, 24-core AMD Threadrippers, 2 NVIDIA RTX 3090s, and 10gbps Ethernet. You might want to choose even more expensive GPUs.

2. Make sure your electrical circuits have sufficient capacity to run your GPU machines at peak power consumption. I gave each machine its own US residential electrical circuit. If you are storing your GPU servers in a data center, look into whether they can get you enough electrical power for Lambda Labs's 8-GPU machines. When talking with a data center's sales team, make sure they understand how much electrical power you need. They might charge you a lot of money if you ask for much more electrical power than they usually install in a cabinet. Try to negotiate with multiple data centers to see you can give you the best offer.

3. Purchase storage machines from 45Drives. I recommend buying their 30-drive machines and setting up a ZFS pool of 10 3-drive mirrors. Do not bother with raidz because your read and write speeds will be too slow, bottlenecking your ETL and training jobs.

4. Serve files from your storage machines to your GPU machines using NFS. I like to use MergerFS to merge mounts from different NFS servers. Alternatively, you might want to use Ceph, Min.io, or Lustre.

5. Buy Intel NUCs to run miscellaneous services--like monitoring--that you wouldn't want to colocate with your storage or GPU machines. They are small, cheap, and don't require a lot of electrical power. I bought a couple of NUCs with 64 GB of RAM and a 1 TB NVMe SSD each. Then I purchased external 10gbps Ethernet cards to plug into each NUC's 40gbps Thunderbolt 3 port.

6. Buy 10gbps network switches. MikroTik has affordable 4-port, 8-port, and 16-port 10gbps switches. These are SFP+ (optical) switches, so you may need to buy copper adapters. I really like MikroTik's balance of quality and affordability, so I also buy network routers and other equipment from MikroTik.

7. If possible, try to train models small enough that each model only needs one machine to train. For this reason, maybe you will want to buy one 10-GPU machine instead of 5 2-GPU machines. There are Amdahl's Law-style coordination costs to using multiple machines to train the same model. When I do large hyperparameter searches over many candidate models, I minimize these coordination costs and maximize throughput by limiting each model to only one machine. Of course, this is impossible if you are like AppliedAI and need 48 V100s to train a model.

8. If you do need to train a single model using multiple machines, I've heard good things about Horovod, but I'm also excited about Ray.io--which offers user-friendly distributed training wrappers around TensorFlow MultiWorkerMirroredStrategy, PyTorch's DistributedDataParallel, or Horovod (which itself can train TensorFlow, PyTorch, or MXNet).


Aren‘t preemptible/spot instances a way of dramatically reducing the public cloud cost if the training jobs are designed to be resumable/resilient to interruptions? Most providers offer also GPUs with this pricing model.


tangent: i would dearly love to read a similar article focusing on practical advice on industrial application of statistical modelling, probabilistic programming & Bayesian inference


Step 1: sample Step 2: Run model ...lots of time passes... Step 3: sample from your sample Step 4: GOTO 2

I'm sortof joking, but also not. Specifically Bayesian inference takes forever, and there's no really good way to speed it up (GPU's don't work as well, because the sampling is sequential).


aha, there's some interesting stuff in "productization of Stan" talks from stancon 2018

https://www.youtube.com/watch?v=4vfilYZ-F3A


500 million parameters seems like a lot, are there not duplication or redundancies that can reduce the parameters. One could also use batches of data. Seems very expensive!


TLDR of the top two point: "get accepted to YC and use cloud credits." and use dedicated servers from Cirrascale.

Saved you a click.


Use a small network, train it in a local GPU




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: