It's interesting to contrast "Measure. Don't tune for speed until you've measured" with Jeff Dean's "Latency Numbers Every Programmer Should Know" [0].
Dean is saying (implicitly) that you can estimate performance, and therefore you can design for speed a priori - without measuring, and, indeed, before there is anything to measure.
I suspect that both authors would agree that there's a happy medium: you absolutely can and should use your knowledge to design for speed, but given an implementation of a reasonable design, you need measurement to "tune" or improve incrementally.
I've had the pleasure of working with some truly fast pieces of code written by experts. It's always both. You have to have a good sense of what's generally fast and what's not in order to design a system that doesn't contain intractable bottlenecks. And once you have a good design you can profile and optimize the remaining constraints.
But e.g. if you want to do fast math, you really need to design your pipeline around cache efficiency from the beginning – it's very hard to retrofit. Whereas reducing memory allocations in order to make parallel algorithms faster is something you can usually do after profiling.
Yeah, the latency numbers provide a ceiling for your algorithm. The actual performance depends on the implementation, code generation, runtime hazards, small dependencies one may have overlooked etc.
I mean...you should always design with speed in mind (In that Jeff Dean sense :) but what 'premature optimization' is referring to, is more like localized speed optimizations/hacks. Don't do those until a) you know you'll need it and b) you know where it will help.
(I'm not an expert. I'd love to be corrected by someone who actually knows.)
Floating-point arithmetic is not associative. (A+B)+C does not necessarily equal A+(B+C), but you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first. So, in theory, transformers can be deterministic, but in a real system they almost always aren't.
Not an expert either, but my understanding is that large models use quantized weights and tensor inputs for inference. Multiplication and addition of fixed-point values is associative, so unless there's an intermediate "convert to/from IEEE float" step (activation functions, maybe?), you can still build determinism into a performant model.
Fixed point arithmetic isn't truly associative unless they have infinite precision. The second you hit a limit or saturate/clamp a value the result very much depends on order of operations.
Ah yes, I forgot about saturating arithmetic. But even for that, you wouldn't need infinite precision for all values, you'd only need "enough" precision for the intermediate values, right? E.g. for an inner product of two N-element vectors containing M-bit integers, an accumulator with at least ceil(log2(N))+2*M bits would guarantee no overflow.
True, you can increase bit width to guarantee never hit those issues, but right now saturating arithmetic on types that pretty commonly hit those values is the standard. Guaranteeing it would be a significant performance drop and/or memory use increase with current techniques to the level it would significantly affect availability and cost compared to what people expect.
Similarly you could not allow re-ordering of operations and similar - so the results are guaranteed to be deterministic (even if still "not correct" compared to infinite precision arithmetic) - but that would also have a big performance cost.
> you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first
Technically possible, but I think unlikely to happen in practice.
On the higher level, these large models are sequential and there’s nothing to parallelize. The inference is a continuous chain of data dependencies between temporary tensors which makes it impossible to compute different steps in parallel.
On the lower level, each step is a computationally expensive operation on a large tensor/matrix. These tensors are often millions of numbers, the problem is very parallelizable, and the tactics to do that efficiently are well researched because matrix linear algebra is in wide use for decades. However, it’s both complicated and slow to implement fine grained parallelism like “adding together whichever two finish first” on modern GPUs. Just too much synchronization, when total count of active threads is many thousands, too expensive. Instead, operations like matrix multiplications are often assigning 1 thread per output element or fixed count of output elements, and reduction like softmax or vector dot product are using a series of exponentially decreasing reduction steps, i.e. order is deterministic.
However, that order may change with even minor update of any parts of the software, including opaque pieces at the low level like GPU drivers and firmware. Library developers are updating GPU kernels, drivers, firmware and OS kernels collectively implementing scheduler which assigns work to cores, both may affect order of these arithmetic operations.
I don't think the order of operations is non-deterministic between different runs. That would make programming and researching these systems more difficult than necessary.
There are two extremes here: first, the "architects" that this article rails against. Yes, it's frustrating when a highly-paid non-expert swoops in to offer unhelpful or impossible advice.
On the other hand, there are Real Programmers [0] who will happily optimize the already-fast initializer, balk at changing business logic, and write code that, while optimal in some senses, is unnecessarily difficult for a newcomer (even an expert engineer) to understand. These systems have plenty of detail and are difficult to change, but the complexity is non-essential. This is not good engineering.
It's important to resist both extremes. Decision makers ultimately need both intimate knowledge of the details and the broader knowledge to put those details in context.
Another point is that the world is always changing. If you work slowly, you are at much greater risk of having an end result that isn't useful anymore.
(Like the author, of course, I'm massively hypocritical in this regard).
I think that there are three relevant artifacts: the code, the specification, and the proof.
I agree with the author that if you have the code (and, with an LLM, you do) and a specification, AI agents could be helpful to generate the proof. This is a huge win!
But it certainly doesn't confront the important problem of writing a spec that captures the properties you actually care about. If the LLM writes that for you, I don't see a reason to trust that any more than you trust anything else it writes.
"Couples often flake together. This changes the probability distribution of attendees considerably"
It's interesting to consider the full correlation matrix! Groups of friends may tend to flake together too, people who live in the same neighborhood might rely on the same subways or highways...
I think this is precisely the same problem as pricing a CDO, so a Gaussian Copula or graphical model is really what you need. To plan a great party.
We tend to calculate "people at percentages", ie: 2 adults, 2 kids, 50% chance of showing up rates as an attendance-load of 1.5 virtual people (for food calculations).
Then sometimes you need the "max + min souls" (seats, plates), and account for what we call "the S-factor" if someone brings an unexpected guest, roommate, etc.
Lastly: there is a difference between a "party" and a "soirée" (per my college roommate: "you don't have parties, you have soirées!")
All the advice is really accurate, makes me miss hosting. If you want to go a little deeper, there's a book called "How to be a Gentleman", and it has a useful section on "A Gentleman Hosts a Party", and then "Dads Own Cookbook" has a chapter on party planning, hosting, preparation timelines... there's quite a bit of art and science to it!
> We tend to calculate "people at percentages", ie: 2 adults, 2 kids, 50% chance of showing up rates as an attendance-load of 1.5 virtual people (for food calculations).
>
> Then sometimes you need the "max + min souls" (seats, plates), and account for what we call "the S-factor" if someone brings an unexpected guest, roommate, etc.
I made myself a "food and drinks amount" calculator for weekends/week-long party events a few years back and it was eerily accurate once you take in unexpected plus ones, flake rates, hangovers and other computable-at-scale events into the formula!
I’ve never had the mental bandwidth to try to manage my manager and team like this. While I don’t trust them to provide the best feedback, I also don’t trust that I won’t make mistakes. And what does it matter if I cannot control everything, unless too much risk is involved.
The color of that bike shed is distracting, though. Is it purple or pink?
Dean is saying (implicitly) that you can estimate performance, and therefore you can design for speed a priori - without measuring, and, indeed, before there is anything to measure.
I suspect that both authors would agree that there's a happy medium: you absolutely can and should use your knowledge to design for speed, but given an implementation of a reasonable design, you need measurement to "tune" or improve incrementally.
0: https://gist.github.com/jboner/2841832
reply