More

mpyne · 2026-05-30T17:47:20 1780163240

> I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

We benchmark non-deterministic things all the time and it's frankly not even that unusual or hard. You yourself indicate that one model outperforms another one in your experience on various facets, and that is itself a benchmark.

The more relevant question is probably how well does a given benchmark translate to improvement on a specific desired outcome or task. The military uses the ASVAB testing battery to benchmark potential new recruits for suitability in various career specialties, but the actual outcome the benchmark is meant to correlate with is later success in the training pipeline.

So every so often the various military branches have to do and compare ASVAB results against training results and make sure that they still have a predictive relationship.

And this is benchmarking real flesh-and-blood human beings where you get on the order of magnitude of a million data points or so per year. You can benchmark AIs much more efficiently than that, as non-deterministic as they are, and as long as the benchmark itself is reasonably predictive of outcome it's going to be useful information.

kaydub · 2026-05-30T18:18:33 1780165113

My anecdotal experience isn't a benchmark. Just because I feel like something is better or different doesn't mean it actually is.

mpyne · 2026-05-31T04:11:01 1780200661

> Just because I feel like something is better or different doesn't mean it actually is.

Of course, but it is a data point, and multiple such data points can be aggregated. This is true even if all you can do is compare two things.

The shape of that data will reveal something more about the thing you want to measure than the null hypothesis you'd otherwise have.

mpyne · 2026-05-29T00:45:33 1780015533

In fairness, with the waterfall methodology that pervaded back then, the "first" system you shipped was actually the second. "Build one to throwaway; you will, anyhow".

db48x · 2026-05-29T04:21:24 1780028484

Actual waterfall development was far more iterative than most people realize. If you want a primary source, I recommend Sunbust and Luminary by Don Eyles. It recounts the development of the software for the Apollo lunar lander.

mpyne · 2026-05-31T04:14:59 1780200899

Then it wasn't "actual waterfall development". The paper that defined waterfall literally tells you that you will build the system twice. Ideally only twice. You can refer to Dr. Royce's paper as the primary source on that.

It is heartening to know that iterative development has been commonplace since long before the agile manifesto was written though, to make it clear that it has long been used and long been successful.

mpyne · 2026-05-23T16:56:57 1779555417

> In fact I thought the government had long since gotten pretty serious about using smartcards and HSMs for everything?

They do use it for a lot, but there are a lot of things that need to authenticate to each other in a modern ecosystem, especially if you're trying to replace security based on network boundaries as trust boundaries with zero trust (as the government is).

I worked with more than a few IL4 systems where the PKI/smartcard stuff was simply shoved into an F5 that did TLS termination and then everything on the internal VPC just used HTTP headers without even a crypto signature to convey which user had actually logged in.

As with anything else, the more you make it easy to the do the right thing, the more often you tend to see the right thing being done. So agencies that make it easy to request server PKI certs see increased uptake, other agencies just have server-to-server auth done by PSKs / API keys instead.

So the concern isn't usually cost but compliance, if it's nearly impossible to get those little developer experience affordances ATO'd themselves, agencies will instead just focus on getting the mission system itself ATO'd come hell or high water and the devs just get told to piece it together however...

mpyne · 2026-05-22T21:02:49 1779483769

Yes, the harness they used actually existed and was in use beforehand, it wasn't developed for testing with Mythos.

mpyne · 2026-05-20T00:04:26 1779235466

> The stdin-vs-stdout split is where I see the most actual "is this a TTY" mistakes though. Tools that emit JSON-on-stdout-when-piped and TUI-when-not work fine until something stuffs them into a PTY with piped stdin — then they're in TUI mode but can't actually read the user input format they expect.

Stuff like this is why a build script I used to maintain would redirect stdin from /dev/null when running commands that were intended to be non-interactive. You only need one script to hang forever waiting for a user to type in a password to decide that you'll force the issue going forward.

jonnyasmar · 2026-05-20T00:08:59 1779235739

Same problem flipped: I once watched a CI step hang for 47 minutes because some sub-command popped a `read -p "Continue?"` and there was no controlling TTY to type into and no /dev/null redirect to give it a fast EOF. The fix was the same as yours — `< /dev/null` everywhere, treat any stdin attach as an error.

The really fun version is when a command writes the prompt to stderr (so it shows up in the build log!) and then reads from a stdin you didn't realize was still open. Took embarrassingly long to track down.

mpyne · 2026-05-18T00:06:14 1779062774

Recruiting for those considering careers, and marketing more broadly for those who pay taxes.

mpyne · 2026-05-17T22:29:58 1779056998

They'd actually make economic sense where I live, the only thing that's held me from pulling the trigger is that I want to time it with when I need to have the roof inspected/replaced.

I'm aware of the arguments about how it can be that much cheaper when deployed at mass centralized scale rather than decentralized across a bunch of rooftops, however the way the electric markets are prices is based primarily on the cost to produce the marginal supply, which is usually gas.

So while the power company might flood a bunch of solar panels trying to capture the profit between cost to generate solar vs. cost to generate using gas, those profits haven't been lowering electric costs at residential rates. If anything those costs are still climbing.

It's actually not hard to get rooftop solar to pencil out in that situation, especially if you assume even moderate growth in future electricity rates or inflation. In my own tracker it would even be superior to paying down additional principle on my home mortgage!

Admittedly it would be less of a slam dunk if the net metering was less generous around here as you'd basically be required to add battery to the mix if you weren't already. But even that just prolongs the time to payoff, it still ends up having good ROI economically speaking.

mpyne · 2026-05-17T21:14:28 1779052468

My time in submarines at sea just coincided with the last few years where smoking on submarines was still authorized.

It was awful, just awful. Especially in a space as cramped as a submarine and with a common ventilation system, you can't just put the smokers in a convenient spot all to themselves, they're always going to be near something the rest of the crew needs to access.

Gravityloss · 2026-05-18T16:07:22 1779120442

I bet some individuals would bring back smoking in submarines in a heartbeat, if offered suitable sponsoring...

mpyne · 2026-05-17T15:26:08 1779031568

> The process didn’t work before because the person writing the requirements either put out vague requirements or bad requirements because they didn’t understand the business intent (or were careless).

You make it sound like writing good requirements is easy.

If it were easy we wouldn't need all these concepts around PMF, product pivots and the like. And even before that was Peter Naur's paper "Programming as Theory Building" [1].

If you truly understand the problem you're solving with software then requirements can be easy. But usually we don't, not right away, and so we have to build up our understanding of the problem first in order to solve it.

Even then, the problem we solve may not have been the problem paying users will have, so you can have "good requirements" and still have a bad business, or even the opposite where you somehow build a working business despite bad requirements, because you hit upon a customer's need quite by mistake.

Nothing about any of this precludes LLMs being helpful, though nothing guarantees LLMs will be helpful either.

[1]: https://cekrem.github.io/posts/programming-as-theory-buildin...

rubyfan · 2026-05-17T22:33:39 1779057219

> You make it sound like writing good requirements is easy.

I am certain I didn’t say that. To be a good product owner one needs skill, care and understanding of the business intent. If you know the business intent but lack the skill to express it as a useful requirement then it’s insufficient; if you have the skill but lack understanding or ability to understand the business intent then it’s insufficient; if you have the skill and understand the business intent but you are careless in your work then it’ll be insufficient too. If the problem space is emergent then having all three might not be good enough either.

It’s certainly true that good engineering teams can deeply understand the problem space enough to get to a business outcome without requirement documents.

I just wouldn’t bet that LLMs are going to make any of these realities any better, they might exacerbate those issues.

mpyne · 2026-05-18T02:15:12 1779070512

> I just wouldn’t bet that LLMs are going to make any of these realities any better, they might exacerbate those issues.

Yes, that's certainly a fair assessment, especially the more it convinces software developers they can talk to the LLM rather than talking to users.

mpyne · 2026-05-17T14:58:35 1779029915

If there are datacenters sitting idle right now then you could probably make a lot of money selling that capacity to Anthropic at this point...