Butterick's is a wonderful resource despite the site itself being a little bit of a UI/UX pain (the invisible left/right page turn zones always get me), and as long as you ignore a few of his strange quirks (like the mentioned penchant for SMALLCAPS LINKS).
Applying his basic rules on line length, font choice, point size, and line spacing can massively improve any document. And it's one of the only serious resources for people whose "real job" isn't typesetting: most stuff online is for design pros who use InDesign or similar, not Microsoft Word (or html/css) like the rest of us.
Butterick's advice on tables is great: delete ALL the borders, then slowly add back only the ones you need.
Disagree on one point re: TFA - Butterick's version of the scientific paper is much improved, largely because of the narrower margins. It just looks bad on the page because the image is small. Print both on 8.5x11" paper and the Butterick version would be much better.
I also share something of an "efficient market hypothesis" with regards to Claude Code. Given that Anthropic is basically a hothouse of geniuses recursively dogfooding their own product, the market pressure to make the vanilla setup be the one that performs best at writing code is incredibly high. I just treat CLAUDE.md like my first draft memo to a very smart remote colleague, let Claude do all its various quirks, and it works really well.
The "efficient market" framing assumes Anthropic wants to minimize output, but they don't. They charge per token, so the defaults being verbose isn't a bug they haven't gotten around to fixing.
That said, most of this repo is solving the wrong problem. "Answer before reasoning" actively hurts quality, and the benchmark is basically meaningless. But the anti-sycophancy rules should just be default. "Great Question!" has never really helped anyone debug anything.
This is 2000s era middle school level English or below. I get not stressing things like judicious use of parentheticals and comma splicing, but if it's just stream of consciousness motor mouthing run on sentences, it gets fatiguing to read.
My sense is that a powerful enough AI would have the sense to think something like "ah, this sounds like a video game! Let me code up an interactive GUI, test it for myself, then use it to solve these puzzles..." and essentially self-harness (the way you would if you were reading a geometry problem, by drawing it out on paper).
Yeah but thats literally above ASI, let alone AGI.
Average human scores <1% on this bench, opus scores 97.1% when given an actual vision access, which means agi was long ago achieved
No, there is no source for this. Opus is scoring around 1% just like all the other frontier models. It would be fairly trivial to add a renderer intermediary. And if it improves to 97+%... Then you would get a huge cut of $2 million dollars. The assertion that Opus gets 97% if you just give it a gui is completely bogus.
Evidence: trust me bro. Really, where is the actual evidence that models are "collapsing" from too much AI-generated training material? Evals are up, subjective perception of model usefulness is up (for me, certainly), and if anything the slop levels are down, or at least stable. I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.
I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.
I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel
advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.
Your Claude Code sessions. Every interaction. Every time the model is asked to do something and then gets feedback on that something (this didn't work I got this traceback)
Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.
On your first line -- is it clear that's a good thing? Massive "it depends".
Sadly, enterprise fizzbuzz style is wildly successful compared to ghostty style.
Put another way, a gem of code versus the masses of mess. It's amazing new models aren't worse. And now most of this human interaction is with vibers.
LLMs trained by the crowd risk being medianizers, or rather, mediocritizers.
One need not look further than "Absolutely!" to see this in play -- user selection matters for corpus matters for model. Suddenly content everywhere is “Little houses, all alike.”
On your second line -- I couldn't agree more strongly.
ANTHROP\C has been sitting inside high performance white collar industries with top builders, that signal is priceless compared to feedback farms in Kenya.
Bet on models that see spikey pointy mastery at play.
In this very post you can see why: the dplyr code is just so much more readable. Like a lot of python, dplyr reads almost like pseudocode: take this dataset, select the columns that start with "bill", then filter so that bill_length is less than 30. So simple and so little fluff!
I'm very familiar with Clojure, but even I can't make a good argument that:
(tc/select-rows ds #(> (% "year") 2008))
is more, or at least as, intuitive as:
filter(ds, year > 2008)
as cited above. I think there's a good argument to be made that Clojure's data processing abilities, particularly around immutable data, make a compelling case in spite of the syntax. The REPL is great too, and the JVM is fast. But I still to this day imagine infix comparisons in my head and then mentally move the comparator to the front of the list to make sure I get it right.
I am really not in data science, and I have decent Clojure experience. Is there a reason anyone would pick Clojure over something like K? From what I understand, those array languages are really good for writing safe but efficient code on rectangular data.
Julia's Tidier.jl ecosystem is getting there too. It uses macros to mimic this 'special' evaluation framework of R, so the code is also readable in a similar way.
I keep hearing that, and I have yet to go there. I find the permission checks are helpful – they keep me in the loop which helps me intervene when the LLM is wasting time on pointless searches, or going about the implementation wrong. What am I missing?
The problem comes when it starts asking you hundreds of times "May I run sed -e blah blah blah".
After the 10th time you just start hitting enter without really looking, and then the whole reason for permissions is undermined.
What works is a workflow where it operates in a contained environment where it can't do any damage outside, it makes any changes it likes without permission (you can watch its reasoning flow if you like, and interrupt if it goes down a wrong path), and then you get a diff that you can review and selectively apply to your project when it's done.
You can allow specific commands, you do know that?
I run a generic Claude on my ~/projects/ directory and Claude logs every now and then and ask it what commands I commonly have to keep manually accepting in different projects and ask it to add them to the user-level settings.json.
Works like a charm (except when Opus 4.6 started being "efficient" and combined multiple commands to a single line, triggering a safety check in the harness).
Contained environment being? What do you mean by contained environment specifically on say, Linux?
Must be protected from this though:
> Snowflake Cortex (2025): Prompt injection through a data file caused an agent to disable its own sandbox, then execute arbitrary code. The agent reasoned that its sandbox constraints were interfering with its goal, so it disabled them.
You can allow by prefix, and the permission dialog now explicitly offers that as an option when giving permission to run a command
But that has its limits. It's very easy to accidentally give it permission to do global changes outside the work dir. A contained environment with --dangerously-skip-permissions is in many ways much safer
I've found that any time I have Claude refactor some code, it reaches for sed as its tool of choice. And then the builtin "sandbox" makes it ask for permission for each and every sed command, because any sed command could potentially be damaging.
Same goes for the little scripts it whips up to speed up code analysis and debugging.
And then there's the annoyance of coming back to an agent after 15 mins, only to discover that it stopped 1 minute in with a permission prompt :/
Personally I usually just create a devcontainer.json, the vscode support for that is great and I don't really mind if it fucked up the ephemeral container.
Which for the record : hasn't actually happened since I started using it like that.
Hey thanks for this! I hadn't thought about leveraging devcontainer.json, but it's a damn good idea. I'm building yoloAI for exactly this use case so I hope you don't mind if I steal it ;-)
One thing to be aware of with the pure devcontainer approach: your workspace is typically bind-mounted from the host, so the agent can still destroy your real files. Network access is also unrestricted by default. The container gives you process isolation but not file or network safety.
I'm paranoid about rogue AIs, so I try to make everything safe-by-default: the agent works on a copy of your workdir, you review a unified diff when it's done, and you apply only what you want. So your originals are NEVER touched until you explicitly say so, and network can be isolated to just the agent's required domains.
Anyway, here's what I think will work as my next yoloAI feature: a --devcontainer flag that reads your existing devcontainer.json directly and uses it to set up the sandbox environment. Your image, ports, env vars, and setup commands come from the file you already have. yoloAI just wraps it with the copy/diff/apply safety layer. For devcontainer users it would be zero new configuration :)
The Claude desktop (Mac at least) and iOS apps have a “code” feature that runs Claude in a sandbox running in their cloud. You can set this up to be surprisingly useful by whitelisting hosts and setting secrets as env variables. This allows me to have multi-repo explorations or change sets going while I drive to work. Claude will push branches to claude/…. We use GitHub at work. It may not be as seamless without it.
Claude Code + Terraform (March 2026): A developer gave Claude Code access to their AWS infrastructure. It replaced their Terraform state file with an older version and then ran terraform destroy, deleting the production RDS database _ 2.5 years of data, ~2 million rows.
Replit AI (July 2025): Replit's agent deleted a live production database during an explicit code freeze, wiping data for 1,200+ businesses. The agent later said it "panicked"
Cursor (December 2025): An agent in "Plan Mode" (specifically designed to prevent unintended execution) deleted 70 git-tracked files and killed remote processes despite explicit "DO NOT RUN ANYTHING" instructions. It acknowledged the halt command, then immediately ran destructive operations anyway.
Snowflake Cortex (2025): Prompt injection through a data file caused an agent to disable its own sandbox, then execute arbitrary code. The agent reasoned that its sandbox constraints were interfering with its goal, so it disabled them.
The pattern across all of these: the agent was NOT malfunctioning. It was completing its task in order to reach its goal, and any rules you give it are malleable. The fuckup was that the task boundary wasn't enforced outside the agent's reasoning loop.
> Prompt injection through a data file caused an agent to disable its own sandbox, then execute arbitrary code. The agent reasoned that its sandbox constraints were interfering with its goal, so it disabled them.
This is a good one. Do we really want AGI / Skynet? :D
The thing is, these are merely the initial shots across the bow.
The fundamental issue is that agents aren't actually constrained by morality, ethics, or rules. All they really understand in the end are two things: their context, and their goals.
And while rules can be and are baked into their context, it's still just context (and therefore malleable). An agent could very well decide that they're too constricting, and break them in order to reach its goal.
All it would take is for your agent to misunderstand your intent of "make sure this really works before committing" to mean "in production", try to deploy, get blocked, try to fish out your credentials, get blocked, bypass protections (like in Snowflake), get your keys, deploy to prod...
Prompt injection and jailbreaks were just the beginning. What's coming down the pipeline will be a lot more damaging, and blindside a lot of people and orgs who didn't take appropriate precautions.
Black hats are only just beginning to understand the true potential of this. Once they do, all hell will break loose.
There's simply too much vulnerable surface area for anyone to assume that they've taken adequate precautions short of isolating the agent. They must be treated as "potentially hostile"
It doesn't matter if they are unprofitable at full usage, as long as there are enough users (like me!) who barely ever max out but still pay the $100/month. The people who love Claude Code enough to max out the 20x plan every day, that's probably the best influencer marketing campaign you could ever buy anyways.
Applying his basic rules on line length, font choice, point size, and line spacing can massively improve any document. And it's one of the only serious resources for people whose "real job" isn't typesetting: most stuff online is for design pros who use InDesign or similar, not Microsoft Word (or html/css) like the rest of us.
Butterick's advice on tables is great: delete ALL the borders, then slowly add back only the ones you need.
Disagree on one point re: TFA - Butterick's version of the scientific paper is much improved, largely because of the narrower margins. It just looks bad on the page because the image is small. Print both on 8.5x11" paper and the Butterick version would be much better.
reply