1. The small subset is to test that your training pipeline works and converges near 0 loss.
2. Sure, but for most new hacks like mixup, randaugment and etc the results usually transfer over. Problem with deep learning is that most of the new results don't replicate so it's good to have a way to quickly validate things.
3. The lower level features are usually pretty data agnostic and transfer well to new tasks.
Would a random sample be representative? Statistically this seems to be the case for any large N. In fact it's not clear to me that any other sample would be more representative.
Other tips not mentioned in the article:
1. Tune your hyper parameters on a subset of the data.
2. Validate new methods with smaller models on public datasets.
3. Tune models instead of training from scratch (either public models or your previously trained ones).