Hacker Newsnew | past | comments | ask | show | jobs | submit | rs545837's commentslogin

Damn this was really fun to use.


I agree 100% with this thinking approach, I've been working in this domain for quite a few months now.

The right granularity for agents isn't files or lines, it's entities: functions, classes, methods. That's how both humans and agents actually think about code.

We built sem(Ataraxy-Labs/sem) which extracts entities from 30+ languages via tree-sitter and builds a cross-file dependency graph, so building semantic version control and semantic diff. weave (same org) takes it further and does git merges at entity level. Matches functions by name, merges their bodies independently.

The dependency graph also answers questions LLMs can't. I love the analysis based on ASTs.


Oh this is cool. The Yjs as storage backend trick is clever, you basically get CRDT sync for free without having to build your own replication layer. And the pluggable storage means you can develop against in-memory and then flip to YGraph for collab mode without touching your queries. That's a nice developer experience.

The live queries also caught my eye. Having traversals auto reexecute when data changes sounds straightforward until you realize the underlying data is being merged from multiple peers concurrently. Getting that right without stale reads or phantom edges is genuinely hard.

I've been researching on something like this in a similar space but for source code, therefore built a tool called Weave(https://github.com/Ataraxy-Labs/weave) for entity level merges for git. Instead of merging lines of text, it extracts functions, classes, and methods, builds a dependency graph between them, and merges at that level.

Seeing codemix makes me think there might be something interesting here. Right now our entity graph and our CRDT state are two separate things. The graph lives our analysis engine and the CRDT lives in different crate. If something like @codemix/graph could unify those, you'd have a single data structure where the entity dependency graph is the CRDT.


Semantic merge. PlasticSCM had that a feature many years back


yeah definitely plastic scm has always been my inspiration, just trying to revive it.


You could agree that the PR is the meaningful unit for shipping, but push back gently that for agents working in parallel, the commit/changeset level matters more than it used to because agents don't coordinate the way humans do. Multiple agents touching the same repo need finer-grained units of change than "the whole PR."


Could you elaborate a bit more on this? Curious what your workflow looks like. Is this multiple agents running on the same feature/refactor/whatever unit of work? For concurrent but divergent work I just use a git worktree per feature. And I think I only ever have a single agent (with whatever subagents it spins up) per unit of work.


Think two agents working on the same codebase at the same time. Agent A is refactoring the auth module, Agent B is adding a new API endpoint that imports from auth. Separate worktrees, separate branches, but they're touching overlapping code.

ingle agent per feature works great today. But as agents get faster and cheaper, the bottleneck shifts to, how many agents can work on one repo simultaneously without stepping on each other.


jj is genuinely great and I think it deserves way more adoption than it has right now. The mental model is so much cleaner than git, undo actually works the way you'd expect it to, and working with stacked changes feels natural instead of that constant low-grade anxiety of actually breaking something. It's probably the best frontend for version control that exists today.

For the last few months though I've been thinking a lot about what you said at the end there. What if version control actually understood the code it was tracking, not as lines of text but as the actual structures we write and think in, functions, classes, methods, the real building blocks? A rename happening on one branch and an unrelated function addition on another aren't a real conflict in any meaningful sense, they only look like one because every tool we have today treats source code as flat text files.

For enhancing this kind of structural intelligence I started working on https://github.com/ataraxy-labs/sem, which uses tree-sitter to parse code into semantic entities and operates at that level instead of lines. When you start thinking of code not as text there's another dimension where things can go, even a lot of logic at the comiler level with call graphs becomes useful.


This is awesome honestly, Stacked PRs are one of those features that feels obvious in hindsight. Breaking a n-line PR into 3 focused layers where each one is independently reviewable is a huge win for both the author and reviewer. The native GitHub UI with the stack navigator is the right call too, and there's no reason this should require a third-party tool.

One thing I keep thinking about in this same direction: even within a single layer of a stack, line-level diffs are still noisy. You rename a function and update x call sites, the diff shows y changed lines. A reviewer has to mentally reconstruct "oh this is just a rename" from raw red/green text.

Semantic diffing (showing which functions, classes, methods were added/modified/deleted/moved) would pair really well with stacks. Each layer of the stack becomes even easier to review when the diff tells you "modified function X, added function Y" instead of just showing changed lines.

I've been researching something in this direction, https://ataraxy-labs.github.io/sem/. It does entity-level diffs, blame, and impact analysis. Would love to see forges like GitHub move in this direction natively. Stacked PRs solve the too much at once problem. Semantic diffs solve the "what actually changed" problem. Together they'd make code review dramatically better.


One cheap optimization for the compile overhead case: skip commits that only touch files unrelated to the failing test. If you know the test's dependency chain, any commit that doesn't touch that chain gets prior weight zero. Equivalent to git bisect skip but automatic. Cuts the search space before you compile anything.


This is a real pain point. One thing that helps: when an LLM agent makes changes across multiple commits, look at what it actually touched structurally. Often the agent adds a feature in commit 5 but subtly breaks something in commit 3 by changing a shared function it didn't fully understand.


Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.

We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.

One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.

Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.

We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.


> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.


Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.


Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests


You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.

Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:

128 commits, 500 trials:

Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.

Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...


usually the whole discussion has been around line-level vs commit-level history, but there's a layer nobody's talking about, and I have been exploring it here these days with https://github.com/Ataraxy-Labs/sem. It gives you entity-level version control. It parses your code into functions, classes, methods using tree-sitter (12 languages so far), computes a structural hash for each entity, and builds a cross-file dependency graph. So sem diff HEAD~1 doesn't give you "+3 -2 in tax.py", it gives you "calculate_tax signature changed, 47 dependents, 3 callers will break". The key insight is distinguishing signature changes from body changes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: