I was thinking the same thing, it almost looks a little too good to be true -- a...

varelse · on Jan 25, 2016

1-bit SGD and insanely high minibatch sizes (8192) it would appear: which drastically reduces communication costs, making data-parallel computation scale.

If so, while very cool, that's not a general solution. Scaling batch sizes of 256 or lower would be the breakthrough. I suspect they get away with this because speech recognition has very sparse output targets (words/phonemes).

Too bad the code below isn't open-source because they got g2 instances with ~2.5 Gb/s interconnect to scale:

http://www.nikkostrom.com/publications/interspeech2015/strom...

mjw · on Jan 25, 2016

Yep. To elaborate: really big batch sizes can speed up training data throughput, but usually mean that less is learned from each example seen, so time-to-convergence might not necessarily improve (might even increase, if you take things too far).

Training data throughput isn't the right metric to compare -- look at time to convergence, or e.g. time to some target accuracy level on held-out data.

mjw · on Jan 25, 2016

Warp-CTC implements one specific model (or at least, one specific loss function), it's not really a general framework in the same way as the other libraries mentioned.