As part of the Holistic Agent Leaderboard (HAL) initiative at Princeton CITP, we evaluated more than 220 agent runs across 9 benchmarks, the equivalent of over 20,000 agent rollouts across 9 models and 9 benchmarks for a total cost of $40,000. The benchmarks are: AssistantBench, CORE-Bench Hard, GAIA, Online Mind2Web, Scicode, ScienceAgentBench, SWE-bench Verified Mini, TAU-bench Airline, and USACO.
In that process, we “burned” 2.6 billion prompt tokens and learned a lot along the way. In this article, I’d like to share some of the insights we gained, with a particular focus on the GAIA benchmark.
By that definition, the ChatGPT app is now an AI agent. When you use ChatGPT nowadays, you can select different models and complement these models with tools like web search and image creation. It’s no longer a simple text-in / text-out interface. It looks like it is still that, but deep down, it is something new: it is agentic…
https://medium.com/thoughts-on-machine-learning/building-ai-...
Exactly. I think the study is a good reminder that we really have to be careful about the productivity gains attributed to AI. Main takeaway imo, despite limitations from the study, is AI is not a panacea, it can increase productivity, but only if used 'well' and with the good workflows in place, and in the right context.
Klarna I feel also used the 700 fired due to AI and "oops now we're rehiring some" as a nice distraction from the ~2,100 total reduction that occurred from 2022 to 2024.
reply