Have you actually used LLMs for non trivial tasks? They are still incredibly bad when it comes to actually hard engineering work and they still lie all the time, it's just gotten harder to notice, especially if you're just letting it run all night and generate reams of crap.
Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.
I think the OP's comment is entirely fair. Karpathy and others come across to me as people putting a hose into itself: they work with LLMs to produce output that is related to LLMs.
I might reframe the comment as: are you actually using LLMs for sustained, difficult work in a domain that has nothing to do with LLMs?
It feels like a lot of LLM-oriented work is fake. It is compounding "stuff," both inputs and outputs, and so the increased amount of stuff makes it feel like we're living in a higher plane of information abundance, but in reality we're increasing entropy.
Tech has always had an information bias, and LLMs are the perfect vehicle to create a lot of superfluous information.
In my limited experience, using LLMs to code up things unrelated to LLMs (robotics for instance) is significantly less productive than using LLMs to code up things related to LLMs. It works, just not very well and requires a lot more leg work on the user end than in other areas.
To be fair Karpathy isn't known for using LLMs—not that I would assume or question whether he's used them 'for non-trivial tasks', but it's not like making the same comment in reply to Steve Yegge or someone. (However trivial we may think Gastown/Wasteland is in the other sense!)
That is not the case at all, considering that he himself started using and tweeting about llms for coding fairly recently. He's probably less experienced in that area than most people who started using claude cli last year.
He is a researcher who understands neural networks and their architectures exceptionally well. That is all.
That whole thread is just amazing, if you back up a couple of levels from ground zero. Great perspectives from a lot of thoughtful posters.
E.g., you can see a post from a user named dhouston, who mentioned that he was thinking about starting an online file sync/backup service of some sort.
Haha awesome. I guess they were going through YC right then, I still remember their launch video from around then and thinking it was one of the best ads I’d ever seen.
Wait, "Karpathy's Autoresearch", you mean a loop that prompts the agent to improve a thing given a benchmark?
People have been doing this for a year or more, Ralph loops etc.
I hate the weird strange Twitter world of hero-worship for folks that seems to arise just out of large followings.
Joe no-followers does this six months ago, nobody cares. Karpathy writes a really basic loop and it's now a kind of AI miracle prompting tons of grifters, copy-cats, weird hype.
I do wonder if LLMs have just made everyone seriously, seriously dumber all of a sudden. Most of the "Autoresearch" posts I see are completely rubbish, with AI optimizing for nonsense benchmarks and people failing to understand the graphs they are looking at. So yes, the AI made itself better at a useless benchmark while also making the code worse in 10 other ways you don't actually understand.
The number of refurbished mac minis that are available in my country has suddenly dramatically increased ever since the Clawdbot tweet. People never learn.
You would be surprised! Nearly every Fortune 500 company has utilized either our RL fine-tuning package or used our quants and models - the UI was primarily a culmination of pain points folks had when doing either training or inference!
We're complimentary to LM Studio - they have a great tool as well!
What does "normal AMD support" mean here? I was completely unable to get it working on my Ryzen AI 9700 XT. I had to munge the versions in the requirements to get libraries compatible with recent enough ROCm, and it didn't go well at all. My last attempt was a couple weeks before studio was announced.
Actually the opposite haha- more than 50% of our audience comes from large organizations eg Meta, NASA, the UN, Walmart, Spotify, AWS, Google, and the list goes on!
That article is literally a definition of TDD that has been around for years and years. There's nothing novel there at all. It's literally test driven development.
The problem with these kind of tools now is that Codex is so good you can basically build something which is good for 99% of cases in a single day, and it's free...
Look at Tobi vibe-coding QMD, he's not a full-time engineer and vibed that up and now it's used as the defacto RAG engine for OpenClaw.
Yeah QMD is quite impressive! The main difference between us and them is the scale folks would be looking at indexing. The serverless ingestion engine I described in the post is optimized for processing large batch jobs with high concurrency. We depend on a lot of cloud compute for this which isn't something QMD's local-first environment is optimized for. That said, it's a great option for OpenClaw!
This is not a replacement for either in my opinion. Apps like codex and pi are interactive but ax is non-interactive. You define an agent once and the trigger it however you please.
The US can't even confirm how many detainees have died in custody in immigration detention around the country, yet they have precise numbers on how many people the Iranian regime has killed? Give me a break.
If Iran is unwilling to let neutral international observers confirm the number, that suggests they are trying to hide a number they don't want the world to know.
Who gets to define what "neutral" is? According to the US, the International Criminal Court is not fit for this purpose. It certainly can't be a nation-state that's in a military alliance with the US.
Human Rights Watch, MSF, UNICEF? Woke grievance factories, the lot of them /s . World Health Organization? US just left it. It's slim pickings out there.
Which Iran did not do. There's a single report from an anti-Iran agency saying that Iran claimed 3,000 killed protesters (not 20k-30k). Iran never said that though, and I would challenge anyone to produce evidence that they did.
I find those numbers hard to believe, as it is obvious that the US was already planning a regime changing intervention for quite some time when those protests happened.
You can't trust people who paint Reza Pahlavi as a paragon of human rights and democracy. And neither you can trust every iranian refugee as a lot of those were corrupt members of the ruling government or worse, Savak members.
This is so trivial to break it's not worth anything. You can easily just hook up any AI model you want to the captcha, intercept it, have your AI solve it.
Or, you can just script it so if you do have an agent authenticated to Moltbook, you type whatever comment or post you want to your agent, then it solves the captcha and posts your text.
Basically, this method is as about as full of holes as a sieve.
suspect this problem is essentially unsolvable. what possible method wouldn't be vulnerable to this? it's fine if it's just a sort of larp but if people think this could actually work... man
Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.
reply