As a bootstrapper I camped all night outside of bestbuy to get some 3090s. Other...

aledalgrande · on Oct 7, 2021

Great hacks, although you have to be aware of the trade-offs:

1. if you choose the wrong subset, you'll find a non optimum local min

2. still risk dead ends when expanding the model and lengthen the time to finding that out

3. a lot of public models are made from inaccurate datasets, so beware

Overall you have to start somewhere though, and your points are still valid.

m_ke · on Oct 7, 2021

1. The small subset is to test that your training pipeline works and converges near 0 loss.

2. Sure, but for most new hacks like mixup, randaugment and etc the results usually transfer over. Problem with deep learning is that most of the new results don't replicate so it's good to have a way to quickly validate things.

3. The lower level features are usually pretty data agnostic and transfer well to new tasks.

aabaker99 · on Oct 7, 2021

1. Gradient descent almost always finds a non optimum local min (it is not guaranteed to find a global min).

agnosticmantis · on Oct 7, 2021

Isn’t the current best practice to train highly over-parametrized models to zero training error? That’d be a global optima, no?

Unless we’re talking about the optima of test error.

aabaker99 · on Oct 7, 2021

If you find a zero in a non negative function, I would call that a global minima, yes.

aledalgrande · on Oct 8, 2021

Yeah but depending on the data you might have even worse results, selecting the right subset to be representative is really important.

jhgb · on Oct 8, 2021

Would a random sample be representative? Statistically this seems to be the case for any large N. In fact it's not clear to me that any other sample would be more representative.

aledalgrande · on Oct 8, 2021

Many public datasets have skewed classes so if you take a random approach you're not gonna have a good result. And N might not be big enough anyway.