So all the classic optimization theory about staying in the stable region is bas...

danielmarkbruce · 2025-10-07T20:17:36 1759868256

I dont think it's a good way to think of it.

Researchers are constantly looking to train more expressive models more quickly. Any method which can converge + take large jumps will be chosen. You are sort of guaranteed to end up in a place where the sharpness is high but it somehow comes under control. If we weren't there.... we'd try a new architecture until we arrived there. So deep learning doesn't "do this", we do it using any method possible and it happened to be the various architectures that currently fit into "deep learning". Keep in mind many architectures which are deep do not converge - you see survivorship bias.

pkoird · 2025-10-07T16:46:28 1759855588

Reminds me of Simulated Annealing. Some randomness have always been part of optimization processes that seek a better equilbrium than local. Genetic Algorithms have mutation, Simulated Annealing has temperature, Gradient Descent similarly has random batches.