Oh cool, can you share concrete examples of times codex out performed Claude Code? I’m my experience both tools needs to be carefully massaged with context to fulfill complex task.
I don't really see how examples are useful because you're not going to understand the context. My prompt may be something like "We recently added a new transcription backend api (see recent git commits), integrate it into the service worker. Before implementing, create a detailed plan, ask clarifying questions, and ask for approval before writing code"
Nobody has to give you examples. People can express opinions. If you disagree, that’s fine but requesting entire prompt and response sets is quite demanding. Who are you to be that demanding?
Let's call it the skeptical public? We've been listening to a group of people rave about how revolutionary these tools are, how they're able to perform senior level developer work, how good their code is, and how they're able to work autonomously through the use of sub-agents (i.e. vibe coding), without ever providing evidence that would support any of those grandiose claims.
But then I use these tools myself[1] and I speak to real developers who have used them and our evaluation centers around lukewarm, e.g. good at straightforward, junior level tasks, or good for prototyping, or good for initially generating tests, or good for answering certain types of questions, or good for one-off scripts, but approximately none of them would trust these LLMs to implement a more complex feature like a mid-level or senior developer would without very extensive guidance and hand-holding that takes longer than just doing it ourselves.
Given the overwhelming absence of evidence, the most charitable conclusion I can come to is that the vast majority of people making these claims have simply gone from being 0.2X developers to being 0.3X developers who happen to generate 5X more code per unit of time.
Context engineering is a critical part of being able to use the tool. And it's ok to not understand how to use a new tool. The different models combined with different stacks require different ways of grappling with the technology. And it all changes! It sucks that you've tried it for your stack (Elixir, whatever that is) in your way and it was disappointing.
To me, the tool inherently makes sense and vibes with my own personality. It allows me to write code that I would otherwise procrastinate on. It allows me to turn ideas into reality, so much faster.
Maybe you're just hyper focused on metrics? Productivity, especially when dealing with code, is hard to quanitfy. This is a new paradigm and so it's also hard to compare apples to oranges. Does this help?
So your take is that every real software developer I know is simply bad at using this magical tool that performs on the level of mid-senior level software engineer in the hands of a few chosen ones? But the chosen ones never build anything in public where it can be observed, evaluated, and critiqued. How unfortunate is that?
The people I talked to use a wide variety of environments and their experience is similar across the board, whether they're working in Nodejs, React, Vue, Ruby, PHP, Java, Elixir, or Python.
> Productivity, especially when dealing with code, is hard to quanitfy.
Indeed, that's why I think most people claiming these obscene benefits are really bad at evaluating their own performance and/or started from a really low baseline.
I always think back to a study I read a while ago where people without ADHD were given stimulant medication and reported massive improvements in productivity but objective measurements showed that their real-world performance was equal to, or slightly lower than their baseline.
I think it's very relevant to the psychology behind this AI worship. Some people are being elevated from a low baseline whilst others are imagining the benefits.
People do build in public from vibe-coding, absolutely. This tells me that you have not done your research and just gone off of general guesses or pessimism/frustration from not knowing how to use the tool. The easiest way to be able to find this on Github is to look for where Claude is a contributor. Claude will tag itself in the PR or pushes. Another easy way to that I've seen come up for this is there is a whole "BuildInPublic" tag in the Threads app which has been inundated with Vibe coding. While these might not be in your algorithm, they do exist. You'll be able to see that while there is a lot of crud that there are also products being made are actually versatile, complex, and completely vibe-coded. Most people are not making up these stories. It's very real.
Of course people vibe-code in public - I was clear that I wanted to see evidence of these amazing productivity improvements. If people are building something decent but it takes them 3 or 4 times as long as it would take me, I don't care. That's great for them but it's worthless to me because it's not evidence of a productivity increase.
> there are also products being made are actually versatile, complex, and completely vibe-coded.
Which ones? I'm looking for repositories that are at least partially video-documented to see the author's process in action.