Hacker Newsnew | past | comments | ask | show | jobs | submit | fsndz's commentslogin

we can't automate it anyway and vibe coding is overrated: https://medium.com/thoughts-on-machine-learning/vibe-coding-...

I did that, burned 2.6B tokens in the process and learned a lot: https://transitions.substack.com/p/what-burning-26-billion-p...

As part of the Holistic Agent Leaderboard (HAL) initiative at Princeton CITP, we evaluated more than 220 agent runs across 9 benchmarks, the equivalent of over 20,000 agent rollouts across 9 models and 9 benchmarks for a total cost of $40,000. The benchmarks are: AssistantBench, CORE-Bench Hard, GAIA, Online Mind2Web, Scicode, ScienceAgentBench, SWE-bench Verified Mini, TAU-bench Airline, and USACO.

In that process, we “burned” 2.6 billion prompt tokens and learned a lot along the way. In this article, I’d like to share some of the insights we gained, with a particular focus on the GAIA benchmark.


By that definition, the ChatGPT app is now an AI agent. When you use ChatGPT nowadays, you can select different models and complement these models with tools like web search and image creation. It’s no longer a simple text-in / text-out interface. It looks like it is still that, but deep down, it is something new: it is agentic… https://medium.com/thoughts-on-machine-learning/building-ai-...


and people are still saying vibe coding is overrated? nonsense: https://www.lycee.ai/blog/why-vibe-coding-is-overrated


Exactly. I think the study is a good reminder that we really have to be careful about the productivity gains attributed to AI. Main takeaway imo, despite limitations from the study, is AI is not a panacea, it can increase productivity, but only if used 'well' and with the good workflows in place, and in the right context.


I mean, hacker news is still the same aren't they using AI to completely make this website more of whatever it was before ????


Klarna tried this narrative shopping strategy first and it backfired: https://fsndzomga.medium.com/i-have-no-confidence-in-klarna-...


I am tired of seeing this. First the Klarna dude did it and it backfired. Now Andy... People fail to grasp the fact that building AI agents is no longer enough, you need to do more: https://medium.com/thoughts-on-machine-learning/building-ai-...


Klarna I feel also used the 700 fired due to AI and "oops now we're rehiring some" as a nice distraction from the ~2,100 total reduction that occurred from 2022 to 2024.


exactly !


I now understand why some people say MCP is mostly bullshit + a huge security risk: https://www.lycee.ai/blog/why-mcp-is-mostly-bullshit


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: