Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Diamond types author here. I’m working on another blog post - I had a big breakthrough getting collaborative text editing many times faster again using some new ideas. I’m in the process of writing up our new work now.

And there’s another hard problem I’m stuck on: so, CRDTs like mine and Yjs use (random agent id, sequence number) as unique IDs for operations. But a troublesome user might reuse an operation ID, sending different operations to different peers and giving them the same (agent, seq) pair. This results in really awful, weird bugs. Martin Kleppman wrote a paper about BFT in CRDTs. His suggestion is we do what git does and use SHA hashes for IDs instead.

But there’s two problems:

1. We don’t want to store a hash for every keystroke a user makes. That would get big, fast. We want a DAG of hashes like git, except a hash is generated with each keystroke. How do we build an efficient system that can query by hash, without storing all of the hashes?

2. I want users to be able to permanently delete inserted text. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above? (Note if your hash includes the typed character, even if you delete the character itself from disk, you can trivially brute force the hash to recover any typed characters).

Each of these problems in isolation has a solution. But the real prize is how do you solve both at the same time? I think this problem has a solution but I don’t think anyone has solved it yet. Want to name an algorithm and get my eternal respect? Here’s your chance.



If my understanding of byzantine fault tolerance with CRDTs and collaborative editing is correct, it's really only a problem in systems where you have an untrusted user, someone who is going to actively try and corrupt the system?

It seems to me that for the majority of use cases, collaborative editing in a work sense, you would normally have trust that a user isn't actively hostile. And so I'm not convinced that BFT needs to be a high priority goal in most of these implementations.

I think I higher priority issue is with how to aid developers with the use generic CRDT structures within a system where all the rules are not encoded into the CRDTs themselves. So for example when using Yjs with Prosemirror, Prosemirror has its own schema that defines how a document can be structured. However not all those rules are encoded into the underlying Yjs structure, it's possible to have two conflicting edits that results in a Yjs document that doesn't comply with the Prosemirror scheme. The current system throws away any none conforming data when the document is loaded or updated. Because of this you need to load the document into a server side instance of Prosemirror in order to apply that schema when doing a server side merge, or even if you are just rendering the static content after "blindly" merging the documents.

My point really is that CRDTs are an encoding of user intent, and when these generic CRTD are merged you have a structure representing merged user intent, not a final conforming state. The solution is either domain specific CRDT that are tailored exactly to you application, or a schema system for these generic types that apply rules when merging. Prosemirror/Yjs does (mostly) exactly this but I think it needs to be solved in a more generic way that can be run on any layer of the system.


Yes - The BFT problem only matters when you have Byzantine actors. But I think users deserve and expect the system to be reasonably well behaved and predictable in all situations. Anything publically writable, for example, needs BFT resilience. Or any video game.

As for the prosemirror problem, I assume you’re talking about weird merges from users putting markdown in a text crdt? You’re totally right - this is a problem. Text CRDTs treat documents as a simple sequence of characters. And that confuses a lot of structured formats. For example, if two users concurrently bold the same word, the system should see that users agree that it should be bolded. But if that “bold” intent is translated into “insert double asterisks here and here”, you end up with 4 asterisks before and after the text, and that will confuse markdown parsers. The problem is that a text crdt doesn’t understand markdown. It only understands inserting items, and bolding something shouldn’t be treated as an insert.

JSON editing has similar problems. I’ve heard of plenty of people over the years putting json text into a text crdt, only to find that when concurrent edits happen, the json grows parse errors. Eg if two users concurrently insert “a” and “b” into an empty list. The result is [“a””b”] which can’t be parsed.

The answer to both of these problems is to use CRDTs which understand the shape of your data structure. Eg, use a json OT/crdt system for json data (like sharedb or automerge). Likewise, if the user is editing rich text in prosemirror then you want a rich text crdt like peritext. Rich text CRDTs add the concept of annotations - so if two users bold overlapping regions of text, the crdt understands that the result should be that the entire region is bolded. And that can be translated back to markdown if you want.

Luckily, in over a decade of work in the collaborative editing space, I haven’t found any other examples of this problem other than rich text and json. I think if a collaborative editing system supports json data, rich text and plain text then it’ll be enough to add collaborative editing support to 99% of applications.

The ink & switch people did a great write up of how rich text CRDTs work here: https://www.inkandswitch.com/peritext/


> Or any video game

Agreed, BFT is clearly needed for multiplayer CRDT backed video games.

> I assume you’re talking about weird merges from users putting markdown in a text crdt?

Nope, although that is an issue. In that case the document shouldn't be markdown, it should be a rich text CRDT that's converted to markdown as output.

On the conflicts I mentioned, an example, say you are building a to do list app. First let's do it with Prosemirror and Yjs, but for some reason we have decided to limit the length of a to do list to 10 items. Prosemirror will let you do that when defining a schema, have a maximum number of child nodes of the parent node type. With the current Yjs/Prosemirror system, if you have 9 items in the list and two people concurrently add a 10th, one of them will be dropped by prosemirror (deterministically). The document schema enforced that rule outside of the CRDT implementation. Yjs xmlFragments do not have the concept of these sort of rules.

Now say you want to do this with the json like Map and Array types. Again the array type does not have the concept of a length limit, it will merge the two concurrent edits and create a document with 11 entries. In this case your application needs to manage the no longer complying document to correct it.

The issue comes if you are naively merging the documents on the server, and dumping the json, it will not take into account your applications own conflict resolution. My suggestion is that a CRDT schema could do this, it would be a bit like a JSON Schema, but with rules about how to correct misshapen structures.

So yes, I agree these generic rich text plus JSON types cover what 99% of applications need, but they also need to enforce a shape to the data structure that isn't built into the generic types. Having a way to do that as part of the merging layer, rather than application layer, would help to ensure correctness.


Yeah I hear you. I've still never found a good way to implement schema validation rules like "this list cannot have more than 10 items" on top of a CRDT. Eventually consistent collaborative editing systems (like CRDTs and OT based systems) are optimistic. They allow any edits, and then later merge things if necessary. But if you only find out the list has 11 elements at the merging step, what do you do? Its too late to reject either of the inserts which put you in that state.

My best answer at the moment is to tell people to rethink validation rules like that. I can't think of a lot of good use cases for collaborative editing where a "length <= 10" rule is something you want.

Unfortunately, validation rules are really important for referential integrity. If you add a reference to some item in a data set and I concurrently delete that item, what should happen? Does the delete get undone? Does the reference get removed? Is the reference just invalid now? Should references only be to an item at some point in time? (Eg like a git SHA rather than a path)? Maybe optimistic systems just can't have referential integrity? Its an uncomfortable problem.


Hi Joseph,

This is a hard problem and something that I think has to be intrinsic to the serialization method of your CRDT. Unique IDs of any kind are really risky in CRDTs because they basically break the mathematical links between potential partitioned duplicates. Lists make this problem even harder because you have to contend with if you should incorporate the ordinal of each element into the hash value.

I’m working on a version control system for visual programs. You wrote a response to someone on structured edits a few months back about how the UNIX fs not being schema/structure aware is the core problem with collaborative edits and destroys open interoperability. I think you hit the nail on the head. I’ve been working on something for nearly a year now to try to address that problem. Let me know if you want to compare notes.


To answer your question 2. I don’t think you can realistically delete a key, contend with network partitions, and have a true eventually consistent data structure. I think you’re sort of running into the uncommit problem. Feels solvable but at the expense of another trade off you don’t want to make. The solution here is really in workflow. Git solves it by separating the act of committing and pushing. If you push a commit containing a secret, the only thing you can do is invalidate the secret or not push your offline branch containing the commit in the first place.


Then how about the same way as real-world situations where secrets are committed: regenerate the whole history with that string redacted. Not an ergonomic experience, but I think it’s an acceptable tradeoff and users would understand that it’s a nuclear option for a rare usecase.

Disclaimer: layman of the highest degree


I'd be interested to hear more about your version control system for visual programs.

Years back, in 2016 I had a prototype visual programming system where I built a partial Git client in JavaScript to automatically save edits to the program to GitHub. It worked pretty well even with it committing every edit (though there were some bugs).

Email is in my profile if you're interested to share.


> Let me know if you want to compare notes.

Love to. Hit me up if you like - my email address is in my profile.

I'm sometimes bad at responding to email because I'm hyperfocusing or something and falling behind on email. Sorry in advance if that happens.


> 1. We don’t want to store a hash for every keystroke a user makes.

So, why does it have to be every keystroke? We can batch them locally. An approximate rule of thumb to use to guide the batching is typing speed. Using 200 keystrokes/minute, we get an average of 3.33 keystrokes/second. If we use 3 seconds latency for the collaborative editing to resolve concurrent edits, we get 10 keystrokes per batch (3 secs). You can also have an override where smaller batches are replicated to nodes if sufficient time has elapsed since last local edit (eg: 10 secs have elapsed and we have a partial batch of 3 keystrokes only).

> 2. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above?

In order for this to work, we should assume that every replica's current state is always obtained by starting with the empty document and then playing the operations log. We should also permit the author of an operation to rollback the operation. So, I suppose if the collaborative editor shows an interface of all operations that originated locally and the author can select the operations that inserted the AWS key into the replica, they can issue a rollback operations op and have it propagated to all replicas.

Since the point-in-time state of the replicas are regenerated by replaying the operations log, we would have deleted the accidentally inserted key.

In order to address the issue of the AWS key still being in the operations log, we can define the semantics of the rollback op to also include secure erasure of the previous operation from the operations log (i.e., the rollback op would update the original operation into a no-op.

So, when all replicas apply the rollback op initiated by the author, eventually all replicas will converge to having the accidentally inserted AWS key removed from the replica and the operations log.


> So, why does it have to be every keystroke? We can batch them locally.

This approach would absolutely work. Thanks for bringing it up! The problem is that it has a cost I'd rather not pay.

The problem is that batching only works if you don't send anything until the end of the batch. And that gives you an awful tradeoff - the more "live" your editing system is, the bigger the overhead. Its really nice seeing someone type live. Seeing their changes "pop in" after 5 seconds doesn't feel great. It doesn't feel current. And I think its too slow for pair programming over zoom and things like that. It also makes the DAG of changes much more complicated, because there's a much higher chance that changes are concurrent when there's that much latency. And that in turn also makes other aspects of the system do a lot more work to merge.

With hashing alone, you don't need to batch changes like that. Instead, you can just store hashes on disk for (for example) every 100 changes, and regenerate the changes in the middle when needed. (Though you need some other tricks so you can know where to look for a given hash).

Respect where its due, that would 100% solve it. I'm just looking for a way we can delete data without needing to give up the liveness.

> We should also permit the author of an operation to rollback the operation.

This doesn't completely solve the problem of hashes and guessability. Say in the 5 second batch I only type 1 character. We store some metadata, the character and the hash of the metadata & character. Then in the next 5 seconds I issue a rollback operation to delete that character. If I delete the character from durable storage, I can still figure out what the character was by brute forcing things and checking the hash for that batch.

I think thats fixable by adding junk data to every batch. I just bet there's ways to avoid doing that.


Are you using a rolling hash (such as the one employed in Rabin-Karp string search algorithm)? You can then store the hash at check points and the intermediate hashes between checkpoints can be regenerated using the rolling hash.

   +[h1] +[batch1] +[h2] [batch2] -[h3] +[batch3] +[h4]
You would then have the operations log for a replica as

   +[forwardhash(null)] +[hello] +[forwardhash(hello)] +[world] -[reversehash(world)] +[ ] +[forwardhash(wor) -[ld]+[k]
This sequence would give the following replica states after processing each hash:

   +[forwardhash(null)] +[hello] -> hello
   +[forwardhash(hello)] +[world] -> helloworld
   -[reversehash(world)] +[ ] -> hello world
   +[forwardhash(wor) -[ld]+[k] -> hello work
The direction of the roll is just a sign bit of the hash.

We don't need to encode positions separately as it is a function of the state of the replica and the sign of the hash.


I like your approach to thinking about this. Seems like a plausible set of assumptions to make.


I think keystrokes was just a metaphor for whatever operation the application has to deal with.


No, the GP is right. Text CRDTs (including mine) literally generate an operation per inserted character in a collaborative text document. Diamond types solves the data explosion problem by not storing the operations separately. Instead, we internally run-length encoding everything.

If you have these two operations:

    Insert "h" at position 5, ID (seph, 2)
    Insert "i" at position 6, ID (seph, 3)
... They get immediately merged internally to become:

    Insert "hi" at position 5-6, ID (seph, 2-3)
This reduces the number of items we need to process by an order of magnitude. In practice, the metadata (positions, IDs, etc) ends up taking up about 0.1 bytes on disk per inserted character, which is awesome.

This works great for (agent, seq) style IDs because the IDs can also be run-length encoded. But we can't run-length encode hashes. And I don't want my file size to grow by 32x because we store a hash after every keystroke.

(I didn't invent this idea. Martin Kleppman first used it for storing CRDTs on disk, and Kevin Jahns made in-memory structures use the same tricks in yjs)


I was responding to the diamond-types CRDT author's question in the parent comment. Their github project page [1] mentions text editing a lot:

> This repository contains a high performance rust CRDT for text editing. This is a special data type which supports concurrent editing of lists or strings (text documents) by multiple users in a P2P network without needing a centralized server.

> This version of diamond types only supports plain text editing. Work is underway to add support for other JSON-style data types. See the more_types branch for details.

In any case, I agree with the metaphor and batching granular operations can always be done.

[1]: https://github.com/josephg/diamond-types


Problem 2 can be solved by "cryptographic erasure" techniques.

1. Encrypt the data with an ephemeral key.

2. Upon delete, just wipe the key. The data becomes unrecoverable.

I'm not sure about problem 1. A fast seekable stream cipher like ChaCha might help here (don't use e.g. RC4 for this), but a faster hash like BLAKE3 might be more appropriate. I'm happy to noodle over this some more. These are just my unfiltered thoughts.

If you combined the two: Each entry in the graph could have a public salt used for key derivation for the actual data. This salt would not be part of the DAG, but stored outside of it. To delete an entry, zero-out the salt and the underlying key becomes unrecoverable, which makes the data unreadable, without needing to change the history of the DAG.


Yes. But diamond types stores a new entry in the DAG for every keystroke each user makes. The current scheme of IDs being (agent, sequence number) allows subsequent operations by the same user to be trivially run-length encoded. Eg if I make 1000 changes in order, we can encode that as (“seph”,1..1000). It takes much less than 1 byte of overhead on disk per character typed.

Salts per character + cryptographic erasure would solve problem 2. But how do we do it without a massive increase in storage, memory use and network bandwidth? I don’t want to generate and store a cryptographically unique salt per keystroke. (Even though it might run fine on modern hardware).


I apologize for commenting on this with almost no context about what the actual data structure looks like, but would it be possible to put the random salts "higher up in the tree"? Then whenever you want to delete some small leaf element, you actually delete a whole salted subtree and replace it with a new subtree, mostly the same but with a new salt and missing the deleted element? This might make edits more expensive, in exchange for reducing the storage overhead of the salts?


You can store a lot per typed character and never come close to the storage requirements of pictures and video.

Writing a novel is a couple million keystrokes over a year, so even storing a kilobyte per keystroke is trivial.


I'm not sure. I don't know if it's even possible, but until someone proves it isn't, I'm probably going to keep thinking about it.


I was re-reading your 'CRDTs-go-brr' blog post just today! Recently prototyping a major collaborative feature based on Automerge, and am discovering some of its performance limits on our largest data samples (25+ MB of JSON). Very very interested in JSON-like CRDTs and any progress in that area. It's a different hard problem to rich text editing, but has similarly large potential.


Do you really need to store the hashes?

I imagine in an application like this there would be a "settled" set for which all agents agree, and a "pending" set which is agreed by some but not all agents. The hash could be used solely for the pending set to get agreement on order of operations, but after that it could simply be dropped altogether. The sequence number could even be generated on the receiving agent's side.

Also, even for the "pending" set you wouldn't even need to store the hash in your primary data structure. You could simply add a dictionary from (agent id, seq) to hash, right?

The second issue seems reasonable easy to solve with this: include a salt in the hash. You're only storing the "head" hash of the "settled" set, so when the salts aren't stored there is no way to reconstruct this hash from the settled set itself.


I think I am missing something obvious, but can't you just take your existing DAG and add a merge validation step to solve this?

If I'm understanding the problem right a malicious user, uses the same source id to send action a to user 1 and action b to user 2. This has started two independent (and currently indistinguishable branches in your DAG) When those branches come into contact at say user 3 weird bugs start arising.

However, user 3 is receiving a DAG of all the operations along with the operations. Can't it use that to validate the DAG chain? If it finds a conflict it could simply create a branch at the conflict point by modifying the IDs at the point of conflict. Say by appending a hash of just that operation to the id?


it’s definitely not an exact solution, but i’ve been thinking about ways prolly trees could fit into CRDT systems. they essentially give you efficient peer-to-peer entry-wise diffs, so could be good for identifying conflicts after-the-fact. but if you store operations in a prolly tree, you end up hashing them all (plus logarithmic additional space overhead), so it might be a non-starter.

i have a blog post to shill if you haven’t heard of them: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...


1. Can you merge inserts of characters into inserts of strings reliably given both the original inserts and the current document history? E.g. if the closed set of history that only inserts in the middle of the subset of inserts that you want to turn into a string those inserts can be permanently merged into a string insert.

2. String merging should also allow null merges, e.g. identity the same closed set of history inserts that refer to only the inside of the original inserts, and so permanently deleting history of characters in the document could be replaced with a null insert that can still be referred to by other inserts/merges.


Would (1) hit problems because you may store them coalesced eventually, but you sent them to the other clients as individual events, where they may get interleaved with events from that client?

I think it might mess with fine-grained undo, also?

(idk much about CRDTs)

"Reject re-used sequence numbers" might be simpler?


I know you said you didn't like the idea of hashes, so this might be a non-starter for your use case.

But it sounds like a V5 Guid [0] might meet some of the needs for your "operation Ids". The existing "random agent id", and "sequence number" could be used as part of the namespace / salt.

[0] https://www.sohamkamani.com/uuid-versions-explained/#v5-non-...


I don’t have a problem with hashes generally. Correct me if I’m wrong, but this scheme sounds like hashing with extra steps?

I agree this could be done. But diamond types (my library) considers every keystroke to be a new change. If we store a UUID for every keystroke, that’s very inefficient.


Sounds like inverted hash chains. Generate a hash chain h_i = sha(h_i-1) for a random h_0 up to h_N. Each message m_i becomes (agent_id, state, hmac(h_(N-i+1), state), h_(N-i)). Upon receipt of message m_(i+1) the receiver can verify the hmac of the preceding state and that the key is the preimage of the earlier message. By repeated hashing any gaps in the sequence can be fast-forwarded through and still verified. This provides for deletion without a loss of verifiability up to length N.


I was thinking of something like this but where you only send the whole shebang like you mention every now and then, I'll call that a "sync point".

Between the sync points, each message only contains part of the hmac, proportional to the state size, and not h_(N-i). Ie if sending a single keystroke then a single byte from the hmac, if sending a few bytes then two bytes of hmac etc.

If the sequence is continuous you can still verify as you can regenerate h_(N-i) from the previous sync point. A deletion can then still be done by issuing a redact operation which deletes everything from the start of the secret up to the next sync point, and adding an operation which inserts back the data between the end of the secret to the next sync point.

My thought was that while finding a collision for a single keystroke message might not be hard, doing it for two messages in a row should be quite difficult. However it would "just" roughly double the size of the messages on average, rather than add dozens of bytes to each.

I was also thinking that h_0 should be based on data includes something that's not up to the individual actor, to make preimage attacks harder.

Anyway, way past bedtime and not my area at all so this might all be drivel, but interesting problem to ponder...


How familiar are you with RON? I don’t think it supports “forgetting” operations but it does a good job of eliding sequential op IDs and hashes. https://replicated.cc/


Troublesome user? As in a malicious user? Are you trying to lock down your attack surface against script kiddies or is there another instance where this situation comes up?


> or is there another instance where this situation comes up?

Yes, if you are a Byzantine general trying to coordinate an attack, and you aren't sure of your peers' true allegiance.


Yes, a malicious user. I’ve also seen buggy software try to reuse IDs - which causes similar mayhem. It would be better if we could remove this entire class of problems.


It could mean that, as in "someone who causes trouble" or it could be a result of a bug which only affects a handful of users who suddenly "have some troubles".


Use the hash of the hash instead of the hash? The hash is not secret so never needs to be deleted.

Are merkle trees relevant here?


1. Bloom for existence, re-create via brute-force search (?), checkpoint hash every $N seconds(?)

2. Godbolt => "This rendered text came from these portions of the DAG/tree, press CTRL-Del to irrecoverably remove them from the DAG and/or replace them with 'REDACTED'".

#1 is "simply" related to efficiency. You can lean on probabilistic data structures to see "yeah, probably this was HERE in the history tree", and then perhaps replay the "N" seconds checkpoints with the original input data to validate or match exactly where the hash is/was.

#2 is the same problem GIT has, where you need to rewrite / filter / "detached head" in order to obliterate something from history, and is somewhat a local/remote UI issue. "A request to obliterate history is being built in parallel over here => click here to adopt it, and diff/transfer only your new work to it"

Related to #2 I've had a thought of a social network protocol, and how to "delete malicious nudes" or "oops, I accidentally published my SSN and Tax Filing *.pdf to my friends". On the "real" open internet, google or malicious actors would/could store that info forever. However, in a semi-closed group (eg: 100-1000 "friends of friends") that are "all" running participating/conforming clients, a "request-to-delete" or "request-to-disavow" is particularly valuable.

My imagined protocol is: "post001.txt" => "$GOOD_SHA", "nudes.jpg" => "$WHOOPS_SHA", "taxes.pdf" => "$YIKES_SHA", "bait.txt" => "$BAIT_SHA".

"Of course" everything is content-addressable, ie: anyone can fetch "post001.txt" from anyone as "$GOOD_SHA", however, "it would be nice" if I as a participant in the network could query "how_many( $GOOD_SHA );" and see distribution across the network participants. "disavow( $WHOOPS_SHA, $YIKES_SHA ) && how_many( $WHOOPS_SHA, $YIKES_SHA )" => ie: in a "good" network, after a "disavow(...)" then nodes should not respond with that availability. Occasionally throw in "$BAIT_SHA w/ disavow(...)" and make sure that no one is propagating it.

Similar to "key revocation lists", you could eliminate/reject any clients which do not respect requests of non-propagation for $BAIT_SHA or anything that is "disavow(...)'d". Same story with anyone hosting https://en.wikipedia.org/wiki/Celebrity_sex_tape ... I don't really want them in my network, so query for particular content, and if they respond, I can nudge/boot/disengage from users that are hosting them (same with eg: "I love my AK-47 and assault rifle" political memes, disengage / dislike / spread apart people sharing/hosting/propagating content that I don't want).

While you're still left with malicious / non-conforming nodes potentially archiving everything for all eternity, there is a relatively immediate ability to fracture / exclude (shun) nodes that are detected as non-conforming, so "your stuff" is still out there, it's just relatively inaccessible, and you have to be "careful" to have your client be up-to-date w.r.t. revocation actions otherwise you'll be excluded from networks.

Meh. Kindof complicated, but those seem to be in the range of characteristics that you're talking about as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: