Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
we try to make benchmarks for users, but it's like that 20% article - different people want different 20% and you just end up adding "features" and whackamoling the different kinds of 20%
if a single benchmark could be a universal truth, and it was easy to figure out how to do it, everyone would love that.. but that's why we're in the state we're in right now
The problem isn’t with the benchmarks (or the models, for that matter) it’s their being used to prop up the indefensible product marketing claims made by people frantically justifying asking for more dump trucks of thousand-dollar bills to replace the ones they just burned through in a few months.
Absolutely not. This is not a problem with any part of the engineering process. Nearly everything wrong with the AI business lies at the feet of product managers, marketing, the c-suite crowd, etc.