Microservices was originally envisioned to literally create the smallest possible service you could, with canonical Netflix use cases being literally only one or two endpoints per microservice.
Which is great I guess for FAANGs. But makes no sense for just about anyone else.
> Microservices was originally envisioned to literally create the smallest possible service you could
"Micro web services" was coined at one time, back when Netflix was still in the DVD business, to refer to what you seem to be speaking about — multiple HTTP servers with narrow functional scope speaking REST (or similar) that coordinate to build something bigger in a Unix-style fashion.
"Microservices" emerged when a bunch of people at a tech conference discovered that they were all working in similar ways. Due to Conway's Law that does also mean converging on similar technical approaches, sure, but because of Conway's Law we know that team dynamics comes first. "Microservices" wasn't envisioned — it was a label given to an observation.
Microservices came about because Netflix and soon some other FAANGS found they were so big that it made sense to make single-function "micro"-services. They literally chose to massively duplicate functionality across services because their scale was so big it made sense for them.
This is great for FAANG-scale companies.
It makes little sense for most other companies, and in fact incurs all of the overhead you would expect - overly complex architecture, an explosion of failure points, a direct elongation of latency as microservices chain calls to each other, chicken-and-egg circular references among services, and all of the mental and physical work to maintain those dozens (or hundreds, or thousands!) of services.
The funny thing to me is people point to monoliths and say "see, problem waiting to happen, and it will be so hard to undo it later!". But I see microservices, and usually the reality is "We have problems right now due to this architecture, and making it sane basically means throwing most or all of it away".
In reality, unraveling monoliths is not as hard as many people have made out, while reasoning about microservices is much harder than advertised.
Tooling in particular makes this a very hard nut to crack. Particularly in the statically typed world, there are great tools to verify large code bases.
The tooling for verifying entire architectures - like a huge set of microservices - is way behind. Of course this lack of tooling impacts everyone, but it makes microservices even harder to bear in the real world.
Forget about convenient refactoring, and a thousand other things....
Nah. You've made up a fun story, but "microservices" is already recognized as being in the lexicon in 2011, while Netflix didn't start talking about said system until 2012.
> This is great for FAANG-scale companies.
Certainly. FAANG-scale employees need separation. There are so many that there isn't enough time in the day for the teams to stay in close communication. You'd spend 24 hours a day just in meetings coordinating everyone. Thus microservices says instead of meetings, cut off direct communication, publish a public API with documentation, and let others figure it out — just like how services from other companies are sold.
If you are small company you don't have that problem. Just talk to the people you work with. It is much more efficient at normal scale.
I think the big problem no one speaks about is the name "microservices" was incredibly poorly named.
Your design goal should not be "create the smallest service you can to satisfy the 'micro' label". Your design goal should be to create right-sized services aligned to your domain and organization.
The deployment side is of course a red herring. People can and do deploy monoliths with multiple deployments and different endpoints. And I've seen numerous places do "microservices" which have extensive shared libraries where the bulk of the code actually lives. Technically not a monolith - except it really is, just packaged differently.
> Your design goal should not be "create the smallest service you can to satisfy the 'micro' label".
A place I worked at years ago did what I effectively called "nano-services".
It was as if each API endpoint needed its own service. User registration, logging in, password reset, and user preference management were each their own microservice.
When I first saw the repo layout, I thought maybe they were just using a bunch of Lambdas that would sit behind an AWS API Gateway, but I quickly learned the horror as I investigated. To make it worse, they weren't using Kubernetes or any sort of containers for that matter. Each nanoservice was running on its own EC2 instance.
I swear the entire thing was designed by someone with AWS stock or something.
I know one place that did all of their transactional payment flow through lambdas. There were about 20 lambdas in the critical auth path, and they regularly hit the AWS per-global-account limits.
Another place did all their image processing via lambdas, about fifty of them. They literally used lambdas and REST calls where anyone sane would have done it in one process with library calls. It cost them tens of thousands of dollars a month to do basic image processing that should have cost about $100 or so.
Another key is that you should always be able to reasonably hack on just one of the "services" at once— everything else should be able to be excluded completely or just run a minimal mock, for example an auth mock that just returns a dummy token.
If you've got "microservices" but every dev still has to run a dozen kubernetes pods to be able to develop on any part of it, then I'm pretty sure you ended up with the worst of both worlds.
I agree with this. Personally I think the two pizza team model, single responsibility isnot a great idea. Most successful "microservices" model I've worked on actually had 100ish devs on the service. Enough to make on-call, upgrades, maintenance, etc. really spread out.
Agreed. This is why I prefer the term “service oriented architecture” instead. A service should be whatever size its domain requires - but the purpose is to encapsulate a domain. A personal litmus test I have for “is the service improperly encapsulating the domain” is if you need to handle distributed transactions. Sometimes they are necessary - but usually it’s an architectural smell.
Somewhat related, one of my thoughts was “what if these concierge doctors just keep running tests until they find something, anything, to justify their fees”?
The image concept, in my opinion, is what really limited Smalltalk's appeal and distribution.
The image meant you basically got whatever state the developer ended up with, frozen in time, with no indication really of how they got there.
Think of today's modern systems and open source, with so many libraries easily downloadable and able to be incorporated in your system in a very reproducible way. Smalltalk folks derided this as a low tech, lowest-common-denominator approach. But in fact it gave us reusable components from disparate vendors and sources.
The image concept was a huge strength of Smalltalk but, really in the end in my opinion, one of the major areas that held it back.
Java in particular surged right past Smalltalk despite many shortcomings compared to it, partially because of this. The other part of course was being free at many levels. The other half of Smalltalk issues beyond the image one, was the cost of both developer licenses ($$$$!) and runtime licenses (ugh!).
> The image meant you basically got whatever state the developer ended up with, frozen in time, with no indication really of how they got there.
That wasn't a function of the image system. That was a product of your version control/CI/CD systems and your familiarity with them.
Consider that Docker and other container based systems also deploy images. No reason Smalltalk has to be any different.
I did software development work in Smalltalk in the 90's. We used version control (at one point, we used PVCS, which was horrible, but Envy was pretty sweet), had a build process and build servers that would build deploy images from vanilla images. Even without all that, the Smalltalk system kept a full change log of ever single operation it performed in order. In theory, someone could wipe their changelog, but that's the moral equivalent of deleting the source code for your binary. Image-based systems are no reason to abandon good engineering practices.
> Consider that Docker and other container based systems also deploy images.
Consider also that Docker was the only one to really get popular, perhaps because it promoted the idea of using a text-based "Dockerfile" as your source of truth and treating the images as transitory built artifacts (however false this was in practice).
It's still mostly true in practice. You don't add one more layer to your image to build the next version, you rebuild it from the Dockerfile, which is the opposite of Smalltalk approach.
It's a practical workflow that was actually adopted.
Next you'll be wondering whether changes were integrated continuously throughout the day — "This fine granularity of components enables highly concurrent development because changes can be tracked down to the individual method level …"
I remember early days of docker and how it was not really Dockerfiles early on, but people running image, editing in it, then committing this as new image.
Arguably it goes back to chroot-stuff, and LXC predates Docker by some five years or so. I don't remember the details well but Solaris had similar containers, maybe even before LXC arrived.
I'd say the clown popularised it outside of Linux and Unix sysadmin circles, rather than the Dockerfile format itself.
> Arguably it goes back to chroot-stuff, and LXC predates Docker by some five years or so. I don't remember the details well but Solaris had similar containers, maybe even before LXC arrived.
Solaris and FreeBSD had significantly better implementations of the containerisation/isolation piece from a technical standpoint. But they never caught on. I really think the Dockerfile made the difference.
> Solaris and FreeBSD had significantly better implementations of the containerisation/isolation piece from a technical standpoint.
FreeBSD jails (first) and Solaris zones (later) were 'heavier weight' than containers: folks perhaps did not want to manage a "light VM" to deploy applications.
I agree that the image concept was a problem, but I think that you're focused on the wrong detail.
The problem with an image based ecosystem that I see is that you are inevitably pushed towards using tools that live within that image. Now granted, those tools are able to be very powerful because they leverage and interact with the image itself. But the community contributing to that ecosystem is far smaller than the communities contributing to filesystem based tools.
The result is that people who are considering coming into the system, have to start with abandoning their familiar toolchain. And for all of the technical advantages of the new toolchain, the much smaller contributor base creates a worse is better situation. While the file-based system has fundamental technical limitations, the size of the ecosystem results in faster overall development, and eventually a superior system.
> But the community contributing to that ecosystem is far smaller than the communities contributing to filesystem based tools.
Another point is that you need to export your tools out of your own image so others can import it into their images. This impedance mismatch between image and filesystem was annoying.
I think we could quibble over the relative importance of these points, but I agree in general. The image locking you into that ecosystem is definitely a good point.
> The image meant you basically got whatever state the developer ended up with, frozen in time, with no indication really of how they got there.
I worked with a similar language, Actor (Smalltalk with an Algol-like syntax), and the usual way to deal with distribution was to “pack” (IIRC) the image by pointing to the class that your app is an instance of, and the tool would remove every other object that is not a requirement of your app. With that you got an image that started directly into your app, without any trace of the development environment.
One of the reasons Java got adopted, was that Smalltalk big names like IBM decided to go all in with Java.
It is no accident that Eclipse to this day still has a code navigation perspective based on Smalltalk, it has an incremental compiler similar to Smalltalk experience, and the virtual filesystem used by Eclipse workspaces mimic the behaviour of Smalltalk images.
the entire philosophy of Smalltalk was to think of software artifacts as living entities. You can just find yourself in a piece of software, fully inspect everything and engage with it by way of a software archaeology. To do away with the distinction between interacting, running and writing software.
They wanted to get away from syntax and files, like an inert recipe you have to rerun every time so I think if you do away with the image you do away with the core aspect of it.
Computing just in general didn't go the direction they wanted it to go in many ways I think it was too ambitious of an idea for the time. Personally I've always hoped it comes back.
The thing is that the "scripting" approach, is just so much easier to distribute. Just look at how popular python got. Smalltalk didn't understand that. The syntax is worse than python IMO (and also ruby of course).
A lot of great ideas are tried and tried and tried and eventually succeed, and what causes them to succeed is that someone finally creates an implementation that addresses the pragmatic and usability issues. Someone finally gets the details right.
Rust is a good example. We've had "safe" systems languages for a long time, but Rust was one of the first to address developer ergonomics well enough to catch on.
Another great example is HTTP and HTML. Hypertext systems existed before it, but none of them were flexible, deployable, open, interoperable, and simple enough to catch on.
IMHO we've never had a pure functional language that has taken off not because it's a terrible idea but because nobody's executed it well enough re: ergonomics and pragmatic concerns.
Typed out a response indicating the really good dev experience that F# and Elixir offer, but neither are "pure". Is Haskell the closest mainstream language to meet a purity requirement?
For this very reason I'm working on a development platform that makes all changes part of a cheaply stored crdt log. The log is part of the application, there are some types of simulations inside of this that we can only timestamp and replay but we can always derive the starting position with 100% accuracy.
>> most impressive part of Smalltalk ecosystem is the structure of the image
> The image concept, in my opinion, is what really limited Smalltalk's appeal and distribution.
I'd say these statements are both true. The image concept is very impressive and can be very useful, it certainly achieved a lot of bang for very little buck.
And it also was/is one of the major impediments for Smalltalk, at least after the mid 1980s.
The impressive bit is shown by pretty much the entire industry slowly and painfully recreating the Smalltalk image, just usually worse.
For example on macOS a lot of applications nowadays auto-save their state and will completely return to the state they were last in. So much that nowadays if you have a lot of TextEdit windows open and wish to make sure everything is safe, you kill the program, you don't quit it.
Also, all/most of the shared libraries and frameworks that come with the system are not loaded individually, instead they are combined into one huge image file that is mapped into your process. At some point they stopped shipping the individual framework and shared library binaries.
User interfaces have also trended in the direction of a an application that contains its own little world, rather than editing files that exist within the wider Unix filesystem.
The image accomplished all that and more and did so very efficiently. Both in execution speed and in amount of mechanism required: have a contiguous piece of memory. Write to disk, make a note of the start pointer. On load, map or read it into memory, fix up the pointers if you didn't manage to load at the same address and you're ready to go. On G4/G5 era Macs, the latter would take maybe a second or two, whereas Pages, for example, took forever to load if things weren't already cached, despite having much less total data to load.
But the drawbacks are also huge. You're really in your little world and going outside of it is painful. On an Alto in the mid to late 1970s I imagine that wasn't much of an issue, because there wasn't really much outside world to connect to, computer-wise, and where would you fit it on a 128KB machine (including the bitmap display)? But nowadays the disadvantages far outweigh the advantages.
With Objective-S, I am building on top of Cocoa's Bundle concept, so special directories that can contain executable code, data or both. Being directories, bundles can nest. You can treat a bundle as data that your program (possibly the IDE) can edit. But you can also plonk the same bundle in the Resources folder of an application to have it become part of that application. In fact, the IDE contains an operation to just turn the current bundle into an application, by copying a generic wrapper application form its own resources and then placing the current bundle into that freshly created/copide app.
Being directories, data resources in bundles can remain standard files, etc.
With Objective-S being either interpreted or compiled, a bundle with executable code can just contain the source code, which the interpreter will load and execute. Compiling the code inside a bundle to binaries is just an optimization step, the artifact is still a bundle. Removing source code of a bundle that has an executable binary is just an obfuscation/minimization step, the bundle is still the bundle.
The key is the plural in "Docker containers". You're not doing everything by modifying one Docker container that's been handed down over literally generations, you're rebuilding images as you need to, usually starting from a golden master, but sometimes starting from a scratch image into which you just copy individual files. It's the "cattle, not pets" mentality, whereas a Smalltalk or Lisp Machine image is the ultimate pet.
> You're not doing everything by modifying one Docker container that's been handed down over literally generations
You don't do that with Smalltalk, either, at least not for the last 30 years or so. Smalltalk has worked with version control systems for decades to maintain the code outside the image and collaborate with others without needing to share images.
I try not to think about these things, I've mostly worked with hardware-centric companies and on "legacy" systems. So many things they're doing that no one else does because 5-25 years ago everyone else figured out the lessons from 30-60 years ago, except for these companies.
> you're rebuilding images as you need to, usually starting from a golden master
For example, cp the "golden master" into the current directory and rename it "nbody.pharo_run.image".
"fileIn" the source code file (name passed on the commandline) "nbody.pharo" (and then cleanUp and garbageCollect) and "snapshot" the image.
Then invoke the program "main.st". In this example, the source code file defined a class method BenchmarksGame>>do: which performs a calculation and prints the result on stdout.
Except that's not really what happened. You're ignoring the range of in-image tools which kept track if who did what, where. From versioning of individual methods, to full blown distributed version control systems, which predated git.
Not to sound harsh or gatekeep, but folks who keep repeating the canard that "The Smalltalk image resulted from the developer just banging on the system", mostly never used smalltalk in the first place.
Give the original smalltalk devs some credit for knowing how to track code development over time.
No, I haven't ignored those tools. They were all stop-gaps that worked in a "meh" way to various degrees. Smalltalk was always optimized to one guy banging away on their solution. Add a second developer and things got much hairier, and more so as you kept adding them.
Hmm, well I don't know exactly when Monticello was first developed, but it was certainly in heavy use by the early 2000s. How is that "meh" when compared to ... cvs & subversion?
I don't know much about the systems used in commercial smalltalks of the 90s, but I'm sure they weren't "meh" either (others more knowledgeable than me about them can chime in).
image-centric development is seductive (I'm guilty). But the main issue isn't "we don't know what code got put where, and by whom". There were sophisticated tools available almost from the get go for that.
Its more a problem of dependencies not being pruned, because someone, somewhere wants to use it. So lots of stuff remained in the "blessed" image (I'm only referring to squeak here) which really ought not to have been in the standard distribution. And because it was there, some other unrelated project further down the line used a class here, a class there.
So when you later realise it needed to be pruned, it wasn't that easy.
But nevertheless, it was still done. Witness cuis.
In other words, it was a cultural problem, not a tooling problem. It's not that squeak had too few ways of persisting & distributing code - it had too many.
IMHO, the main problem was never the image, or lack of tools. It was lack of modularisation. All classes existed in the same global namespace. A clean implementation of modules early on would have been nice.
Interesting. Shows how aware they were of these 2025 criticisms, way back in the 80s (which shows how much of an oversimplification these criticisms are of the real situation).
You probably already know about this, but in case you didn't, there is 1 project which adds modules to cuis Smalltalk:
Digitalk’s Team/V unobtrusively introduced a non-reflective syntax and versioned source code using RCS. Team/V could forward and backwards migrate versions of Smalltalk “modules” within an running virtual image.
"When you use a browser to access a method, the system has to retrieve the source code for that method. Initially all the source code is found in the file we refer to as the sources file. … As you are evaluating expressions or making changes to class descriptions, your actions are logged onto an external file that we refer to as the changes file. If you change a method, the new source code is stored on the changes file, not back into the sources file. Thus the sources file is treated as shared and immutable; a private changes file must exist for each user."
1984 "Smalltalk-80 The Interactive Programming Environment" page 458
But the image isn’t just the code, or classes, it’s also the network of objects (instances). And that’s more difficult to version, or to merge branches of.
Given that the instantiation of those objects has been triggered by Smalltalk commands; those Smalltalk commands can be recorded and versioned and replayed to instantiate those objects.
It means that versioning operations, even just displaying the history, effectively have to run the full image from the beginning of the history, or take intermediate snapshots of the image. In addition, there is interaction between the source code changes and the recorded command history. It also doesn't address how merging would be practical. You would have to compare the state of two images side-by-side, or rather three, for three-way merges.
"Within each project, a set of changes you make to class descriptions is maintained. … Using a browser view of this set of changes, you can find out what you have been doing. Also, you can use the set of changes to create an external file containing descriptions of the modifications you have made to the system so that you can share your work with other users."
1984 "Smalltalk-80 The Interactive Programming Environment" page 46
~
The image is throw-away. It's a cache, not an archive.
"At the outset of a project involving two or more programmers: Do assign a member of the team to be the version manager. … The responsibilities of the version manager consist of collecting and cataloging code files submitted by all members of the team, periodically building a new system image incorporating all submitted code files, and releasing the image for use by the team. The version manager stores the current release and all code files for that release in a central place, allowing team members read access, and disallowing write access for anyone except the version manager."
1984 "Smalltalk-80 The Interactive Programming Environment" page 500
Having wrangled many spreadsheets personally, and worked with CFOs who use them to run small-ish businesses, and all the way up to one of top 3 brokerage houses world-wide using them to model complex fixed income instruments... this is a disaster waiting to happen.
Spreadsheet UI is already a nightmare. The formula editing and relationship visioning is not there at all. Mistakes are rampant in spreadsheets, even my own carefully curated ones.
Claude is not going to improve this. It is going to make it far, far worse with subtle and not so subtle hallucinations happening left and right.
The key is really this - all LLMs that I know of rely on entropy and randomness to emulate human creativity. This works pretty well for pretty pictures and creating fan fiction or emulating someone's voice.
It is not a basis for getting correct spreadsheets that show what you want to show. I don't want my spreadsheet correctness to start from a random seed. I want it to spring from first principles.
My first job out of uni was building a spreadsheet infra as code version control system after a Windows update made an eight year old spreadsheet go haywire and lose $10m in a afternoon.
Yeah, that's what OP said. Now add a bunch of random hallucinations hidden inside formulas inside cells.
If they really have a good spreadsheet solution they've either fixed the spreadsheet UI issues or the LLM hallucination issues or both. My guess is neither.
Compared to what? Granted, Excel incidents are probably underreported and might produce "silent" consequential losses. But compared to that, for enterprise or custom software in general we have pretty scary estimates of the damages. Like Y2K (between 300-600bn) and the UK Postal Office thing (~1bn).
Excel spreadsheets ARE custom software, with custom requirements, calculations, and algorithms. They're just not typically written by programmers, have no version control or rollback abilities, are not audited, are not debuggable, and are typically not run through QA or QC.
I'll add to this - if you work on a software project to port an excel spreadsheet to real software that has all those properties, if the spreadsheet is sophisticated enough to warrant the process, the creators won't be able to remember enough details abut how they created it to tell you the requirements necessary to produce the software. You may do all the calculations right, and because they've always had a rounding error that they've worked around somewhere else, your software shows calculations that have driven business decisions for decades were always wrong, and the business will insist that the new software is wrong instead of owning some mistake. It's never pretty, and it always governs something extremely important.
Now, if we could give that excel file to an llm and it creates a design document that explains everything is does, then that would be a great use of an LLM.
Thing is, they are also the common workaround solution for savy office workers that don't want to wait for the IT department if it exists, or some outsourced consultancy, to finally deliver something that only does half the job they need.
So far no one has managed to deliver an alternative to spreedsheets that fix this issue, doesn't matter if we can do much better in Python, Java, C# whatever, if it is always over budget and only covers half of the work.
I know, I have taken part in such project, and it run over budget because there was always that little workflow super easy to do in Excel and they would refuse to adopt the tool if it didn't cover that workflow as well.
exactly. And Claude and other code assistants are more of the same, allowing non-programmers[1] to write code for their needs. And that's a good thing overall.
[1] well, people that don't consider themselves programmers.
Agreed. The tradition has been continued by workflow engines, low code tools, platforms like Salesforce and lately AI-builders. The issue is generally not that these are bad, but because they don't _feel_ like software development everyone is comfortable skipping steps of the development process.
To be fair, I've seen shops which actually apply good engineering practices to Excel sheets too. Just definitely not a majority...
Sometimes it isn't that folks are confortable skipping steps, rather they aren't even available.
As so happens in the LLM age, I have been recently having to deal with such tools, and oh boy Smalltalk based image development in the 1990's with Smalltalk/V is so much better in regards to engineering practices than those "modern" tools.
I cannot test code, if I want to backup to some version control system, I have to manually export/import a gigantic JSON file that represents the low-code workflow logic, no proper debugging tools, and so many other things I could rant about.
But I guess this is the future, AI agents based workflow engines calling into SaaS products, deployed in a MACH architecture. Great buzzword bingo, right?
In my opinion the biggest use case for spread sheet with LLM is to ask them to build python scripts to do what ever manipulations you want to do with the data. Once people learn to do this workplace productivity would increase greatly I have been using LLM for years now to write python scripts that automate different repeatable tasks. Want a pdf of this data to be overlayed on this file create a python script with an LLM. Want the data exported out of this to be formated and tallied create a script for that.
How will people without Python knowledge know that the script is 100% correct? You can say "Well they shouldn't use it for mission critical stuff" or "Yeah that's not a use case, it could be useful for qualitative analysis" etc., but you bet they will use it for everything. People use ChatGPT as a search engine and a therapist, which tells us enough
Yesterday I had to pass a bunch of data to finance as the person that does so had left the company. But they wanted me to basically group by a few columns, so instead of spending an hour on this in excel, I created 3 rows of fake data, gave it to the llm, it created a Python script which I ran against the dataset. After manual verification of the results, it could be submitted to finance.
Yeah I am not a programmer just more tech literate than most as I have always been fascinated by tech. I think people are missing the forest for the trees when it comes to LLMS. I have been using them to create simple bash, bat, python scripts. Which I would not have been able to put together before even with weeks of googling. I say that because I used to do that unsuccessfully but my success rate thorugh the roof with LLM's.
Now I just ask an LLM to create the scripts and explain all the steps. If it is a complex script I would also ask it to add logging to the script so that I can feed the log back to the LLM and explain what is going wrong which allowed for a lot faster fixes. In the early days I and the LLM would be going around in circles till I hit the token limits. And to start from scratch again.
I don't think tools like Claude are there yet, but I already trust GPT-5 Pro to be more diligent about catching bugs in software than me, even when I am trying to be very careful. I expect even just using these tools to help review existing Excel spreadsheets could lead to a significant boost in quality if software is any guide (and Excel spreadsheets seem even worse than software when it comes to errors).
That said, Claude is still quite behind GPT-5 in its ability to review code, and so I'm not sure how much to expect from Sonnet 4.5 in this new domain. OpenAI could probably do better.
> That said, Claude is still quite behind GPT-5 in its ability to review code, and so I'm not sure how much to expect from Sonnet 4.5 in this new domain. OpenAI could probably do better.
It’s always interesting to see others opinions as it’s still so variable and “vibe” based. Personally, for my use, the idea that any GPT-5 model is superior to Claude just doesn’t resonate - and I use both regularly for similar tasks.
I also find the subjective nature of these models interesting, but in this case the difference in my experiences between Sonnet 4.5 and GPT-5 Codex, and especially GPT-5 Pro, for code review is pretty stark. GPT-5 is consistently much better at hard logic problems, which code review often involves.
I have had GPT-5 point out dozens of complex bugs to me. Often in these cases I will try to see if other models can spot the same problems, and Gemini has occasionally but the Claude models never have (using Opus 4, 4.1, and Sonnet 4.5). These are bugs like complex race conditions or deadlocks that involve complex interactions between different parts of the codebase. GPT-5 and Gemini can spot these types of bugs with a decent accuracy, while I’ve never had Claude point out a bug like this.
If you haven’t tried it, I would try the codex /review feature and compare its results to asking Sonnet to do a review. For me, the difference is very clear for code review. For actual coding tasks, both models are much more varied, but for code review I’ve never had an instance where Claude pointed out a serious bug that GPT-5 missed. And I use these tools for code review all the time.
I've noticed something similar. I've been working on some concurrency libraries for elixir and Claude constantly gets things wrong, but GPT5 can recognize the techniques I'm using and the tradeoffs.
Try the TypeScript codex CLI with the gpt-5-codex model with reasoning always set to high, or GPT-5 Pro with max reasoning. Both are currently undeniably better than Claude Opus 4.1 or Sonnet 4.5 (max reasoning or otherwise) for all code-related tasks. Much slower but more reliable and more intelligent.
I've been a Claude Code fanboy for many months but OpenAI simply won this leg of the race, for now.
Same. I switched from sonnet 4 when it was out to codex. Went back to try sonnet 4.5 and it really hates to work for longer than like 5 minutes at a time
Codex meanwhile seems to be smarter and plugs away at a massive todo list for like 2 hours
Yeah, it's like that commercial for OpenAI (or was it Gemini?) where the guy says it lets the tool work on it's complex financial spreadsheets, goes for a walk with a dog, gets back and it is done with "like 98% accuracy". I cannot imagine what the 2% margin of error looks like for a company that moves around hundreds of billions of dollars...
Having AI create the spreadsheet you want is totally possible, just like generating bash scripts works well. But to get good results, there needs to be some documentation describing all the hidden relationships and nasty workarounds first.
Don't try to make LLMs generate results or numbers, that's bound to fail in any case. But they're okay to generate a starting point for automations (like Excel sheets with lots of formulas and macros), given they get access to the same context we have in our heads.
I like this take. There seems to be an over-focus on 'one-shot' results, but I've found that even the free tools are a significant productivity booster when you focus on generating smaller pieces of code that you can verify. Maybe I'm behind the power curve since I'm not leveraging the full capability of the advanced LLM's, but if the argument is disaster is right around the corner due to potential hallucinations, I think we should consider that you still have to check your work for mission critical systems. That said, I don't really build mission critical systems - I just work in Aerospace Engineering and like building small time saving scripts / macros for other engineers to use. For this use, free LLMs even have been huge for me. Maybe I'm in a very small minority, but I do use Excel & Python nearly every day.
I tend to agree that dropping the tool as it is into untrained hands is going to be catastrophic.
I’ve had similar professional experiences as you and have been experimenting with Claude Code. I’ve found I really need to know what I’m doing and the detail in order to make effective (safe) use out of it. And that’s been a learning curve.
The one area I hope/think it’s closest to (given comments above) is potentially as a “checker” or validator.
But even then I’d consider the extent to which it leaks data, steers me the wrong way, or misses something.
The other case may be mocking up a simple financial model for a test / to bounce ideas around. But without very detailed manual review (as a mitigating check), I wouldn’t trust it.
So yeah… that’s the experience of someone who maybe bridges these worlds somewhat… And I think many out there see the tough (detailed) road ahead, while these companies are racing to monetize.
You would hope so. But how many companies have actually changed their IT policy of outsourcing everything to Tata Consultancy Services (or similar) where a sweaty office in Mumbai full of people who don't give a shit run critical infrastructure?
Jaguar Landrover had production stopped for over a month I think, and 100+ million impact to their business (including a trail of smaller suppliers put near bankruptcy). I'd bet Tata are still there and embedded even further in 5 years.
If AI provides some day-to-day running cost reduction that looks good on quarterly financial statements it will be fully embraced, despite the odd "act of god".
Indeed, that slipped my mind. However the Marks and Spencer hack was also their fault. Just searching on it now it seems there is a ray of hope. Although i have a feeling the response won't be a well trained onshore/internal IT department. It will be another offshore outsourcing jaunt but with better compensation for incompetent staff on the outsourcers side.
"Marks & Spencer Cuts Ties With Tata Consultancy Services Amid £300m Cyber Attack Fallout" (ibtimes.co.uk)
My take is more optimistic. This could be an off ramp to stop putting critical business workflows in spreadsheets. If people start to learn that general purpose programming languages are actually easier than Excel (and with LLMs, there is no barrier), then maybe more robust workflows and automation will be the norm.
I think the world would be a lot better off if excel weren’t in it. For example, I work at business with 50K+ employees where project management is done in a hellish spreadsheet literally one guy in Australia understands. Data entry errors can be anywhere and are incomprehensible. 3 or 4 versions are floating around to support old projects. A CRUD app with a web front end would solve it all. Yet it persists because Excel is erroneously seen as accessible whereas Rails, Django, or literally anything else is witchcraft.
> all LLMs that I know of rely on entropy and randomness to emulate human creativity
Those are tuneable parameters. Turn down the temperature and top_p if you don't want the creativity.
> Claude is not going to improve this.
We can measure models vs humans and figure this out.
To your own point, humans already make "rampant" mistakes. With models, we can scale inference time compute to catch and eliminate mistakes, for example: run 6x independent validators using different methodologies.
One-shot financial models are a bad idea, but properly designed systems can probably match or beat humans pretty quickly.
That's at training time, not inference time. And temp/top_p aren't used to escape local minima, methods like SDG batch sampling, Adam, dropout, LR decay, and other techniques do that.
You can zero out temperature and get determinism at inference time. Which is separate from training time where you need forms of randomness to learn.
The point is for the quote "all LLMs that I know of rely on entropy and randomness to emulate human creativity" is a runtime parameter you can tweak down to zero, not a fundamental property of the technology.
Right, but my point is is that even if you turn the temperature all the way down, you're not guaranteed to get an accurate or truthful result even though you may get a mostly repeatable deterministic result, and there is still some indeterminacy.
Not the parent poster, but this is pretty much the foundation of LLMs. They are by their nature probabilistic, not deterministic. This is precisely what the parent is referring to.
All processes in reality, everywhere, are probablistic. The entire reason "engineering" is not the same as theoretical mathematics is about managing these probabilities to an acceptable level for the task you're trying to perform. You are getting a "probablistic" output from a human too. Human beings are not guaranteeing theoretically optimal excel output when they send their boss Final_Final_v2.xlsx. You are using your mental model of their capabilities to inform how much you trust the result.
Building a process to get a similar confidence in LLM output is part of the game.
I have to disagree. There are many areas where things are extremely deterministic, regulated financial services being one of those areas. As one example of zillions, look at something like Bond Math. All of it is very well defined, all the way down to what calendar model you will you use (360/30 or what have you), rounding, etc. It's all extremely well defined specifically so you can get apple to apple comparisons in the market place.
The same applies to my checkbook, and many other areas of either calculating actuals or where future state is well defined by a model.
That said, there can be a statistical aspect to any spreadsheet model. Obviously. But not all spreadsheets are statistical, and therein lies the rub. If an LLM wants to hallucinate a 9,000 day yearly calendar because it confuses our notion of a year with one of the outer planets, that falls well within probability, but not within determinism following well define rules.
The other side of the issue is LLMs trained on the Internet. What are the chances that Claude or whatever is going to make a change based on a widely prevalent but incorrect spreadsheet it found on some random corner of the Internet? Do I want Claude breaking my well-honed spreadsheet because Floyd in Nebraska counted sheep wrong in a spreadsheet he uploaded and forgot about 5 years ago, and Claude found it relevant?
Yup. It becomes clearer to me when I think about the existing validators. Can these be improved, for sure.
It’s when people make the leaps to the multi-year endgame and in their effort to monetise by building overconfidence in the product where I see the inherent conflict.
It’s going to be a slog… the detailed implementations. And if anyone is a bit more realistic about managing expectations I think Anthropic is doing it a little better.
> All processes in reality, everywhere, are probablistic.
If we want to go in philosophy then sure, you're correct, but this not what we're saying.
For example, an LLM is capable (and it's highly plausible for it to do so) of creating a reference to a non-existent source. Humans generally don't do that when their goal is clear and aligned (hence deterministic).
> Building a process to get a similar confidence in LLM output is part of the game.
Which is precisely my point. LLMs are supposed to be better than humans. We're (currently) shoehorning the technology.
> Humans generally don't do that when their goal is clear and aligned (hence deterministic).
Look at the language you're using here. Humans "generally" make less of these kinds of errors. "Generally". That is literally an assessment of likelihood. It is completely possible for me to hire someone so stupid that they create a reference to a non-existent source. It's completely possible for my high IQ genius employee who is correct 99.99% of the time to have an off-day and accidentally fat finger something. It happens. Perhaps it happens at 1/100th of the rate that an LLM would do it. But that is simply an input to the model of the process or system I'm trying to build that I need to account for.
Not OP but using LLMs in any professional setting, like programming, editing or writing technical specifications, OP is correct.
Without extensive promoting and injectimg my own knowledge and experience, LLMs generate absolute unusable garbage (on average). Anyone who disagrees very likely is not someone who would produce good quality work by themselves (on average). That's not a clever quip; that's a very sad reality. SO MANY people cannot be bothered to learn anything if they can help it.
The triad of LLM dependencies in my view: initiation of tasks, experience based feedback, and consequence sink. They can do none of these, they all connect to the outer context which sits with the user, not the model.
You know what? This is also not unlike hiring a human, they need the hirer party tell them what to do, give feedback, and assume the outcomes.
It's all about context which is non-fungible and distributed, not related to intelligence but to the reason we need intelligence for.
> Anyone who disagrees very likely is not someone who would produce good quality work by themselves (on average).
So for those producing slop and not knowing any better (or not caring), AI just improved the speed at which they work! Sounds like a great investment for them!
For many mastering any given craft might not be the goal, but rather just pushing stuff out the door and paying bills. A case of mismatched incentives, one might say.
I would completely disagree. I use LLMs daily for coding. They are quite far from AGI and it does not appear they are replacing Senior or Staff Engineers any time soon. But they are incredible machines that are perfectly capable of performing some economically valuable tasks in a fraction of the time it would have taken a human. If you deny this your head is in the sand.
Capable, yeah, but not reliable, that's my point. They can one shot fantastic code, or they can one shot the code I then have to review and pull my hair out over for a week, because it's such crap (and the person who pushed it is my boss, for example, so I can't just tell him to try again).
It's not much better with planning either. The amount of time I spent planning, clarifying requirements, hand-holding implementation details always offset any potential savings.
Actually, yeah, a couple of times, but that was a rubber-ducky approach; the AI said something utterly stupid, but while trying to explain things, I figured it out. I don't think an LLM has solved any difficult problem for me before. However, I think I'm likely an outlier because I do solve most issues myself anyways.
To me, the case for LLMs is strongest not because LLMs are so unusually accurate and awesome, but because if human performance were put on trial in aggregate, it would be found wanting.
Humans already do a mediocre job of spreadsheets, so I don't think it is a given that Claude will make more mistakes than humans do.
But isn't this only fine as long someone who knows what they are doing has oversight and can fix issues when they arise and Claude gets stuck?
Once we all forget how to write SUM(A:A), will we just invent a new kind of spreadsheet once Claude gets stuck?
Or in other words; what's the end game here? LLMs clearly cannot be left alone to do anything properly, so what's the end game of making people not learn anything anymore?
Well the end game with AI is AGI of course. But realistically the best case scenario with LLM’s is having fewer people with the required knowledge, leveraging LLM’s to massively enhance productivity.
We’re already there to some degree. It is hard to put a number on my productivity gain, but as a small business owner with a growing software company it’s clear to me already that I can reduce developer hiring going forward.
When I read the skeptics I just have to conclude that they’re either poor at context building and/or work on messy, inconsistent and poorly documented projects.
My sense is that many weaker developers who can’t learn these tools simply won’t compete in the new environment. Those who can build well designed and documented projects with deep context easy for LLM’s to digest will thrive.
Why isn't there a single study that would back up your observations?
The only study with a representative experimental design that I know about is the METR study and it showed the opposite. Every study citing significant productivity improvements that I've seen is either:
- relying on self-assessments from developers about how much time they think they saved, or
- using useless metrics like lines of code produced or PRs opened, or
- timing developers on toy programming assignments like implementing a basic HTTP server that aren't representative of the real world.
Why is it that any time I ask people to provide examples of high quality software projects that were predominantly LLM-generated (with video evidence to document the process and allow us to judge the velocity), nobody ever answers the call? Would you like to change that?
My sense is that weaker developers and especially weaker leaders are easily impressed and fascinated by substandard results :)
Everything Claude does is reviewed by me, nothing enters the code base that doesn’t meet the standard we’ve always kept. Perhaps I’m sub standard and weak but my software is stable, my customers are happy, and I’m delivering value to them quicker than I was previously.
I don’t know how you could effectively study such a thing, that avenue seems like a dead end. The truth will become obvious in time.
Okay, and now you give those mediocre humans a tool hat is both great and terrible. The problem is, unless you know your way around very well, they won't know which is which.
Since my company uses Excel a lot, and I know the basics but don't want to become an expert, I use LLMs to ask intermediate questions, too hard to answer with the few formulas I know, not too hard for a short solution path.
I have great success and definitely like what I can get with the Excel/LLM combo. But if my colleagues used it the same way, they would not get my good results, which is not their fault, they are not IT but specialists, e.g. for logistics. The best use of LLMs is if you could already do the job without them, but it saves you time to ask them and then check if the result is actually acceptable.
Sometimes I abandon the LLM session, because sometimes, and it's not always easy to predict, fixing the broken result would take more effort than just doing it the old way myself.
A big problem is that the LLMs are so darn confident and always present a result. For example, I point it to a problem, it "thinks", and then it gives me new code and very confidently summarizes what the problem was, correctly, that it now for sure fixed the problem. Only that when I actually try the result has gotten worse than before. At that point I never try to get back to a working solution by continuing to try to "talk" to the AI, I just delete that session and do another, non-AI approach.
But non-experts, and people who are very busy and just want to get some result to forward to someone waiting for it as quickly as possible will be tempted to accept the nice looking and confidently presented "solution" as-is. And you may not find a problem until half a year later somebody finds that prepayments, pro forma bills and the final invoices don't quite match in hard to follow ways.
Not that these things don't happen now already, but adding a tool with erratic results might increase problems, depending on actual implementation of the process. Which most likely won't be well thought out, many will just cram in the new tool and think it works when it doesn't implode right away, and the first results, produced when people still pay a lot of attention and are careful, all look good.
I am in awe of the accomplishments of this new tool, but it is way overhyped IMHO, still far too unpolished and random. Forcing all kinds of processes and people to use it is not a good match, I think.
This is a great point. LLMs make good developers better, but they make bad developers even worse. LLMs multiply instead of add value. So if you're a good developer, who is careful, pays attention, watches out for trouble, and is constantly reviewing and steering, the LLM is multiplying by a positive number and will make you better. However, if you're a mediocre/bad developer, who is not careful, who lacks attention to detail, and just barely gets things to compile / run, then the LLM is multiplying by a negative number and will make your output even worse.
Or you could, you know, read the article before commenting to see the limited scope of this integration?
Anyway, Google has already integrated Gemini into Sheets, and recently added direct spreadsheet editing capability so your comment was disproven before you even wrote it
> The key is really this - all LLMs that I know of rely on entropy and randomness to emulate human creativity. This works pretty well for pretty pictures and creating fan fiction or emulating someone's voice.
I think you need to turn down the temperature a little bit. This could be a beneficial change.
Random condensation is a great way to put it. This is exactly what I see particularly in email and text summaries, they do not capture the gist of the message but instead just pull out random phrases that 99.9% of the time are not the gist at all. I have learned to completely ignore them.
I’m not sure what you mean by Azure being more painful for FOSS stacks. That is not my experience. Old you elaborate?
However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.
Trying to be as kind as possible in my interpretation of the article, my take was that the author got stock on the "spherical cow" analogy early on and couldn't let it go. I think there are nuggets of good ideas here which generally tries to talk to leaky abstractions and impedance mis-matches in general between hardware and software, but the author was stuck in spherical cow mode and the words all warped toward that flawed analogy.
This is a great example of why rewrites are often important, in both English essays and blogs as well as in software development. Don't get wedded to an idea too early, and if evidence starts piling up that you're going down a bad path, be fearless and don't be afraid of a partial or even total rewrite from the ground up.
yes. and the two nuggets I took were looking a Unix pipe as a concurrent processing notation and pointing out that the Unix R&D for great notations (or the communication thereof?) stopped right before splitting, cloning and merging concurrent streams. I've rarely seen scripts nicely setting up a DAG of named pipes. I'm not aware of a Unix standard tool that would organize a larger such DAG and make it maintainable and easily to debug.
Object Relational Mapping (ORM) tools, which focus on mapping between code based objects and SQL tables, often suffer from what is called the N+1 problem.
A naive ORM setup will often end up doing a 1 query to get a list of object it needs, and then perform N queries, one per object, usually fetching each object individually by ID or key.
So for example, if you wanted to see “all TVs by Samsung” on a consumer site, it would do 1 query to figure out the set of items that match, and then if say 200 items matched, it would do 200 queries to get those individual items.
ORMs are better at avoiding it these days, depending on the ORM or language, but it still can happen.
I dislike ORMs as much as the next ORM disliker, but people who are more comfortable in whatever the GP programming language is than SQL will write N+1 queries with or without an ORM.
Very true. But ORMs did make it particularly easy to trigger N+1 selects.
It used to be a very common pitfall - and often not at all obvious. You’d grab a collection of objects from the ORM, process them in a loop, and everything looked fine because the objects were already rehydrated in memory.
Then later, someone would access a property on a child object inside that loop. What looked like a simple property access would silently trigger a database query. The kicker was that this could be far removed from any obvious database access, so the person causing the issue often had no idea they were generating dozens (or hundreds) of extra queries.
This problem is associated with ORMs but the moment there's a get_user(id) function which does a select and you need to display a list of users someone will run it in a loop to generate the list and it will look like it's working until the user list gets long.
I really wish there was a way to compose SQL so you can actually write the dumb/obvious thing and it will run a single query. I talked with a dev once who seemed to have the beginnings of a system that could do this. It leveraged async and put composable queryish objects into a queue and kept track of what what callers needed what results, merged and executed the single query, and then returned the results. Obviously far from generalizable for arbitrary queries but it did seem
to work.
I think many ORMs can solve (some of) this these days.
e.g. for ActiveRecord there's ar_lazy_preloader[0] or goldiloader[1] which fix many N+1s by keeping track of a context: you load a set of User in one go, and when you do user.posts it will do a single query for all, and when you then access post.likes it will load all likes for those and so on. Or, if you get the records some other way, you add them to a shared context and then it works.
I defense of the application developer, it is very difficult to adopt set theory thinking which helps with SQL when you've never had any real education in this area, and it's tough to switch between it and the loop-oriented processing you're likely using in your application code for almost everyone. ORMs bridge this divide which is why they fall in the trap consistently. Often it's an acceptable trade-off for the value you get from the abstraction, but then you pay the price when you need to address the leak!
Microservices was originally envisioned to literally create the smallest possible service you could, with canonical Netflix use cases being literally only one or two endpoints per microservice.
Which is great I guess for FAANGs. But makes no sense for just about anyone else.
reply