There are plenty of cool advancements in reducing inference cold start when I was meeting with folks in person at FOSDEM this year. However, I still struggle to understand: why would folks care about this?
Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.
Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.
So my question is: who is this for? AWS and GCP running Anthropic models?
At least folks like me care about it. My local hardware is more than enough to handle my app, but given Spectrum's internet service is as fickle as a broken fiddle I'm forced to rent a dedicated cloud gpu that sits idle most days. However, I would save a serious chunk of change if I could boot up a GPU snapshot in ~10s. I evaluated various options a while back and, while modal.com was the fastest, it still took around a minute-ish. Granted, my use case is unique, but I imagine this could be a decent solution for gpu-poor ComfyUI users.
That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.
My teammate has a great time reimplementing Ninja (slop-free) in Go here https://github.com/buildbuddy-io/reninja to make it even faster with Remote Build Execution.
I don't think number of parallel agents is the right productivity metric, or at least you need to account for agent efficiency.
Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.
In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.
Yeah, I don't disagree with your assessment at all. I think the H2A ratio is still a good metric for the AI adoption rate of an organization. At a higher H2A ratio, you will also start to hear people measuring things using token volumes, which I think is also a similar metric (because most models nowadays run on a relatively fixed Tokens/second speed).
All of this is not a direct signal to a productivity boost. I think at higher volumes, you will need to start to account for the "yield" rate of the token volumes above: what are the volumes of tokens that get to the final production deployment? At which stage is it a constraint on the yield? Is it the models, or is it the harness, or something else (i.e. Code Review, CI/CD, Security Scans etc...)? And then it becomes an optimization problem to reduce the Cost of Goods Sold while improving/maintaining Revenues. The "productivity" will then be dissolved into multiple separate but more tangible metrics.
Few experiments like gas town, the compiler from Anthropic or the browser from Cursor managed to reach the Rocket stage, though in their reports the jagged intelligence of the LLMs was eerily apparent. Do you think we also need better models?
I do. The reason why the current generation of agents are good at coding is because the labs have sufficient time and computes to generate synthetic chain-of-thoughts data, feed those data through RL before use them to train the LLMs. These distillation takes time, time which starts from the release of the previous generation of models.
So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.
At the bare minimum tests of unchanged code and dependencies would be skipped (cached) as well in this setup.
More sophisticated rules would have to be set up by hand, but again it’s easy to do with Dagger (as you can expected any kind of logic there).
But the whole point of using Dagger in this setup is to get tests caching out of the box, and for that you need to assemble the container correctly (by only including relevant dependency files).
Oh this is really neat for the Bazel community, as depending on tree-sitter to build a gazelle language extension, with Gazelle written in Go, requires you to use CGO.
Now perhaps we can get rid of the CGO dependency and make it pure Go instead.
I have pinged some folks to take a look at it.
Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.
I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.
Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.
Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.
So my question is: who is this for? AWS and GCP running Anthropic models?