Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was thinking the same thing, it almost looks a little too good to be true -- although it does kinda make sense given the focus on GPU-based clusters. I wonder how this compares to Baidu's warp-ctc [1]. They don't really seem to be the same thing, and maybe I'm missing something since I'm just starting to get into ML, but it seems to be conspicuously absent from this writeup.

[1] https://github.com/baidu-research/warp-ctc



1-bit SGD and insanely high minibatch sizes (8192) it would appear: which drastically reduces communication costs, making data-parallel computation scale.

If so, while very cool, that's not a general solution. Scaling batch sizes of 256 or lower would be the breakthrough. I suspect they get away with this because speech recognition has very sparse output targets (words/phonemes).

Too bad the code below isn't open-source because they got g2 instances with ~2.5 Gb/s interconnect to scale:

http://www.nikkostrom.com/publications/interspeech2015/strom...


Yep. To elaborate: really big batch sizes can speed up training data throughput, but usually mean that less is learned from each example seen, so time-to-convergence might not necessarily improve (might even increase, if you take things too far).

Training data throughput isn't the right metric to compare -- look at time to convergence, or e.g. time to some target accuracy level on held-out data.


Warp-CTC implements one specific model (or at least, one specific loss function), it's not really a general framework in the same way as the other libraries mentioned.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: