More

sluongng · 2026-05-18T20:12:15 1779135135

There are plenty of cool advancements in reducing inference cold start when I was meeting with folks in person at FOSDEM this year. However, I still struggle to understand: why would folks care about this?

Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.

Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.

So my question is: who is this for? AWS and GCP running Anthropic models?

artisin · 2026-05-18T22:19:55 1779142795

At least folks like me care about it. My local hardware is more than enough to handle my app, but given Spectrum's internet service is as fickle as a broken fiddle I'm forced to rent a dedicated cloud gpu that sits idle most days. However, I would save a serious chunk of change if I could boot up a GPU snapshot in ~10s. I evaluated various options a while back and, while modal.com was the fastest, it still took around a minute-ish. Granted, my use case is unique, but I imagine this could be a decent solution for gpu-poor ComfyUI users.

binsquare · 2026-05-19T00:40:39 1779151239

I work in a slightly different domain and I focus a lot on optimizing coldstarts.

Here's my 2cents: improve cold starts also means utilizing resources more effectively.

From cloud providers to end users - every ms both adds up and translates to additional waste of electricity/hardware and costs.

sluongng · 2026-04-02T03:35:18 1775100918

You can run bisect with first-parent

hauntsaninja · 2026-04-02T03:59:17 1775102357

That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.

sluongng · 2026-03-30T13:27:48 1774877268

My teammate has a great time reimplementing Ninja (slop-free) in Go here https://github.com/buildbuddy-io/reninja to make it even faster with Remote Build Execution.

setheron · 2026-03-30T13:54:39 1774878879

This is cool. Going to see if I can use it at work.

sluongng · 2026-03-02T18:09:24 1772474964

Yeah the 8 agents limit aligns well with my conversations with folks in the leading labs

https://open.substack.com/pub/sluongng/p/stages-of-coding-ag...

I think we need much different toolings to go beyond 1 human - 10 agents ratio. And much much different tooling to achieve a higher ratio than that

Scea91 · 2026-03-02T21:30:44 1772487044

I don't think number of parallel agents is the right productivity metric, or at least you need to account for agent efficiency.

Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.

In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.

sluongng · 2026-03-03T08:32:28 1772526748

Yeah, I don't disagree with your assessment at all. I think the H2A ratio is still a good metric for the AI adoption rate of an organization. At a higher H2A ratio, you will also start to hear people measuring things using token volumes, which I think is also a similar metric (because most models nowadays run on a relatively fixed Tokens/second speed).

All of this is not a direct signal to a productivity boost. I think at higher volumes, you will need to start to account for the "yield" rate of the token volumes above: what are the volumes of tokens that get to the final production deployment? At which stage is it a constraint on the yield? Is it the models, or is it the harness, or something else (i.e. Code Review, CI/CD, Security Scans etc...)? And then it becomes an optimization problem to reduce the Cost of Goods Sold while improving/maintaining Revenues. The "productivity" will then be dissolved into multiple separate but more tangible metrics.

schipperai · 2026-03-02T18:39:55 1772476795

Few experiments like gas town, the compiler from Anthropic or the browser from Cursor managed to reach the Rocket stage, though in their reports the jagged intelligence of the LLMs was eerily apparent. Do you think we also need better models?

sluongng · 2026-03-02T19:43:31 1772480611

I do. The reason why the current generation of agents are good at coding is because the labs have sufficient time and computes to generate synthetic chain-of-thoughts data, feed those data through RL before use them to train the LLMs. These distillation takes time, time which starts from the release of the previous generation of models.

So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.

sluongng · 2026-03-01T08:09:02 1772352542

Most of the time, the CI resources in a python monorepo is not spent on packaging. It’s spent on running the tests.

I would love to read more about how the author is tackling the testing problem in their setup.

danielgafni · 2026-03-01T14:26:05 1772375165

Hey, I’m the author.

At the bare minimum tests of unchanged code and dependencies would be skipped (cached) as well in this setup.

More sophisticated rules would have to be set up by hand, but again it’s easy to do with Dagger (as you can expected any kind of logic there).

But the whole point of using Dagger in this setup is to get tests caching out of the box, and for that you need to assemble the container correctly (by only including relevant dependency files).

sluongng · 2026-02-27T08:06:19 1772179579

https://sluongng.substack.com/i/186718212/test-is-king I wrote about this less than a month ago. Things are moving pretty fast in this direction.

sluongng · 2026-02-25T19:01:02 1772046062

Oh this is really neat for the Bazel community, as depending on tree-sitter to build a gazelle language extension, with Gazelle written in Go, requires you to use CGO.

Now perhaps we can get rid of the CGO dependency and make it pure Go instead. I have pinged some folks to take a look at it.

dilyevsky · 2026-02-25T22:06:10 1772057170

would also be nice to have this support gopackagesdriver backend

odvcencio · 2026-02-25T19:12:57 1772046777

thanks so much for the note! i really appreciate it. i built this precisely for folks like yourself with this specific pain, thanks again!

sluongng · 2026-01-29T09:02:16 1769677336

Not yet in Linux?

sluongng · 2026-01-28T12:49:12 1769604552

I suspect they just use no_std whenever its applicable

https://github.com/facebook/buck2/commit/4a1ccdd36e0de0b69ee...

https://github.com/facebook/buck2/commit/bee72b29bc9b67b59ba...

Turn out if you have strong control over the compiler and linker instrumentations, there are a lot of ways to optimize binary size

sluongng · 2026-01-27T12:24:33 1769516673

Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.

I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.