Keep in mind that the "news cycle" isn't of much use in this field. For 2025, almost all "mainstream" media was dead wrong in their takes. Remember the Deepseek r1 craze in feb25? Where nvda is dead, oai is dead and so on? Yeah... that went well. Remember all the "no more data" craze? Despite no actual researcher worth their salt saying it or even hinting at it? Remember the "hitting walls" rhetoric?
The media has been "social media'd", with everything being driven by algorithms, everything being about capturing attention at the cost of everything else. Negativity sells. FUD sells.
> Remember all the "no more data" craze? Despite no actual researcher worth their salt saying it or even hinting at it?
We ran out of fresh interesting data. A large chunk of training needs to generate its own now. Synthetic data training became a huge thing over the last year.
> Remember the "hitting walls" rhetoric?
Since then the basic training slowed down a lot and improvements are more in the agentic and thinking solutions, with lots more reinforcement training than in the past.
The fact we worked around those problems doesn't mean they weren't real. It's like people say Y2K wasn't a problem... ignoring all the work that went into preventing issues.
No, we didn't. Hassabis has been saying this for a while now, and Gemini3 is proof of that. The data is there, there are still plenty of untapped resources.
> Synthetic data training became a huge thing over the last year.
No, people "heard" about it over the last year. Synthetic data training has been a thing in model training for ~2 years already. L3 was post-trained on synthetic-only data, and was released in apr24. Research only was even earlier with the phi family of models. Again, if you're only reading the mainstream media you won't get an accurate picture of these things, as you'd get from actually working in this field, or even following good sources, read the key papers and so on.
> The fact we worked around those problems doesn't mean they weren't real.
The way the media (and some influencers in this space) have framed it over the last year is not accurate. I get that people don't trust CEOs (and for good reasons), but even amodei was saying there is no data problem in early interviews in 25.
I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
The unit economics seem tough at that price for a 1T parameter model. Even with MoE sparsity you are still VRAM bound just keeping the weights resident, which is a much higher baseline cost than serving a smaller model like Haiku.
Eh. It's at least debatable. There is a moat in compute (this was openly stated at a meeting of AI tech ceos in china, recently). And a bit of a moat in architecture and know-how (oAI gpt-oss is still best in class, and if rumours are to be believed, it was mostly trained on synthetic data, a la phi4 but with better data). And there are still moats around data (see gemini family, especially gemini3).
But if you can conjure up compute, data and basic arch, you get xAI which is up there with the other 3 labs in SotA-like performance. So I'd say there are some moats, but they aren't as safe as they'd thought they'd be in 2023, for sure.
It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...
Sure, I agree! I did not mean to see better results because LLMs improved significantly in their visual-spatial reasoning, but simply because I expected more people drawing SVGs of pelicans on bikes and having more LLMs ingesting them. This is what I find a bit surprising.
> What started as a community experiment is becoming infrastructure. The developers behind Ralph and Taskmaster figured out something real. Now the platforms are catching up.
> That’s usually how it goes. The practitioners find the patterns first. Then the patterns become features.
This is the scariest thing atm with the fast pacing of these things. As capabilities increase, everything you've spent time on building (w/ scaffolding, tooling, etc) gets "merged" into the all-you-can-prompt solution that the big labs provide. If your previous work has no differentiation, it's very hard to provide additional value / monetise it. And it's hard to know what will be differentiation or what will get eaten up.
It's that sci-fi story trope of the colony ship that gets overtaken by a new generation engine, and when they reach their planet they find a thriving colony there already. But with software :)
Even funnier is ghuntley going around giving talks, posting on x etc saying “if you don’t learn this stuff you’re gonna be left behind”. I think this was around when he made the first Ralph blog post?
Which of course couldn’t be true. Any “prompt skill” is going to be commodified. Thats the entire premise AI companies are trying to sell.
I am musing with the idea that other than dipping a toe in, investing a lot of time in this stuff is a waste (for this reason) and a better use is get deeper human domain experience in XYZ.
Agree, like I said in my other post reviewing what Jason Lemkin was doing with GTM: pick one tool, learn how to use it in real world scenarios, gain experience and you will be hyper employable.
It’s the application that matters imo
100% agree with this, tooling is no longer an advantage.
The only thing I say to myself on this, given how Ralph took months to be notice (and taskmaster is still flying somehow under the radar comparatively speaking), is that people are drowning with the volume of new features / tools. It’s what you do with these tools that matters at the end of the day and shift goes more on GTM, marketing etc. rather than uniqueness of software. Most important thing: find clients :)
Anyway interesting times!
I find it funny that as these systems become better at something (i.e. "basic ass CRUD"), people still maintain that they're only good at those and nothing else.
> VIBETENSOR is an open-source research system software stack for deep learning,
generated by LLM-powered coding agents under high-level human guidance. In
this paper, “fully generated” refers to code provenance: implementation changes
were produced and applied as agent-proposed diffs; validation relied on builds, tests, and differential checks executed by the agent workflow, without per-change manual diff review.
Eh, like everything on the Internet, the anti crowd is becoming more obnoxious than the pro crowd ever was. It has become an identity thing, more than a technical thing, and it always sucks when it devolves into that.
It is still a technical thing though. AI generated code is outright buggy when it’s not mediocre but the pro AI crowd is pretending you can guardrail and test suite your way to good generated code. As if painting a picture in negative space is somehow less work than painting it directly. And that’s when you know all the requirements (the picture) upfront.
Huh, I'm the exact opposite. With the exception of Hannah Fry's work at deepmind (where she acts as a charismatic proxy for the more nerdy guests), he is by far the best interviewer on technical stuff (AI stuff mostly, but some early robotics stuff as well). He knows the field, he asks pertinent questions and more importantly he knows when to just let the speaker speak.
Compared to someone like Dwarkesh, it's night and day. There's a fine line between pushing the guest and just interrupting them every 2nd thought to inject your own "takes".
I think similar to Joe Rogan that's the main value he provides to listeners. He identifies guests that have some veil of intellectualism and provides them with a platform to speak.
However I don't think that makes for an interesting interviewer. There are no challenging questions, only ones he knows will fit into the narrative of what the guest wants to say. I might as well read a 2-3 hour PR piece issued by the guests.
What you call "platforming" I often call "listening to what someone says/thinks". Not every interview needs challenging questions, or to be a battle/debate, and sometimes it's not appropriate (above George Hotz being an example, difference in qualifications being another). But, I enjoy trying to understand someone, quirks and all, especially the human aspect, flaws and all. It's interesting seeing the differences in people.
From what I've seen, people that crave "challenging questions" usually most enjoy activist interviewers that are very strongly aligned with their own (usually political) worldview. I don't think that describes Lex Fridman, or me as a listener, at all, and that's fine.
No, not every interview. But if an interviewee presents fiction/hatred as fact the interviewer should have the ability to call that out or at least caution the reader with a "I don't know about that".
A specific example that comes to mind is Eric Weinstein's appearance on the podcast and letting him talk about his "long mouse telomere experiment flaws" without questions which at that point had been thoroughly debunked.
I find little interesting "human aspect" to be found therein, as it usually boils down to "you are lying (to us/yourself) for your own gain", which isn't novel.
There are podcasts that do a similar long form format well. A great example is the German format "Alles gesagt?" (~="Nothing left unsaid?"), where interesting personalities can talk for however long hey want, but the interviewers ask interesting/dynamic follow up questions, and also have the journalistic acumen/integrity to push back on certain topics (without souring the mood).
> letting him talk about his "long mouse telomere experiment flaws" without questions
This requires that the interviewer is as knowledgable as the interviewee (the qualification problem I mentioned). Unless the questions and answers are known ahead of time, it won't be possible to know everything an interviewee will say. Assuming this is the case, how should he have handled that response? Should he not interview people outside of his own expertise? I think one way would be "is there any disagreement?" but then you're left with the same problem.
I think Lex Fridman not knowing much about the history/current state of rat telomere research is entirely reasonable. I think a requirement of knowing the entire context of a person is not reasonable. I also don't think it's reasonable to believe everything you hear in an interview, from either human. "Charitable interoperation, but verify" is a good way to take in information.
I'm fine with opinionated people who have lived in broadly along the socio economic ladder. Atleast their opinions are grounded across a richer experience of life. Rather than just growing up upper middle to wealthy and saying you dropped out of college to make YC funded startup.
No disrespect to founders who do get there, it's certainly an accomplishment. But I'd rather listen to loud erratic Netflix engineer Dr disrespect.
Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.
reply