Hard to be sure because the source of that information isn't known, but generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs.
Other reported training costs tend to include rental of the cloud hardware (or equivalent if the hardware is owned by the company), e.g. NVIDIA H100s are sometimes priced out in cost-per-hour.
Citation needed on "generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs".
It would be simply wrong to exclude the staffing costs. When each engineer costs well over 1 million USD in total costs year over year, you sure as hell account for them.
If you have 1,000 researchers working for your company and you constantly have dozens of different training runs in the go, overlapping each other, how would you split those salaries between those different runs?
Calculating the cost in terms of GPU-hours is a whole lot easier from an accounting perspective.
The papers I've seen that talk about training cost all do it in terms of GPU hours. The gpt-oss model card said 2.1 million H100-hours for gpt-oss:120b. The Llama 2 paper said 3.31M GPU-hours on A100-80G. They rarely give actual dollar costs and I've never seen any of them include staffing hours.
No, they don't! That's why the "5.5 million" deepseek V3 number as read by American investors was total bullshit (because investors ignored their astrik saying "only final training run")
Yeah, that's one of the most frustrating things about these published numbers. Nobody ever wants to share how much money they spent on runs that didn't produce a useful model.
As with staffing costs though it's hard to account for these against individual models. If Anthropic run a bunch of training experiments that help them discover a new training optimization, then use that optimization as part of the runs for the next Opus and Sonnet and Haiku (and every subsequent model for the lifetime of the company) how should the cost of that experimental run be divvied up?
No, because what people are generally trying to express with numbers like these, is how much compute went into training. Perhaps another measure, like zettaflop or something would have made more sense.