Okay, I just tried this on my pet transformer training benchmark and the results...

kouteiheika · on March 16, 2025

Slight update, more fancy initialization of DyT weights (instead of having them be ones) seems to help a lot in my case (although it's still not as good as just using RMSNorm). Do something like this on the very first training step (`x` is the input to the layer):

    y = x.to(torch.float32)
    y = y * torch.rsqrt(y.pow(2).mean(-1, keepdim=True) + 1e-6)
    z = torch.tanh(self.alpha * x)
    scale = (y / (z + 1e-6)).mean(dim = -2).flatten()
    self.weight.detach().copy_(scale)

This basically tries to initialize the weights so that the output of DyT is closer to what RMSNorm would have outputted, and it seems to help.

kadushka · on March 15, 2025

Which model are you training and on what dataset?

kouteiheika · on March 15, 2025

It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).

For the dataset I just use FineWeb-Edu.

kadushka · on March 16, 2025

Wow, thank you for the link to the code - I haven't seen it before - it contains a ton of useful tricks. Lots to learn from there.