1-bit SGD and insanely high minibatch sizes (8192) it would appear: which drasti...

mjw · on Jan 25, 2016

Yep. To elaborate: really big batch sizes can speed up training data throughput, but usually mean that less is learned from each example seen, so time-to-convergence might not necessarily improve (might even increase, if you take things too far).

Training data throughput isn't the right metric to compare -- look at time to convergence, or e.g. time to some target accuracy level on held-out data.