Our goal is to make it easier to write code that handles failures - failed outbound api calls, infrastructure issues like a host dying, problems talking between services. The primitive we offer is that we guarantee that your handlers always run to completion (whether to a result or a terminal error)
The way we do that is by writing down what your code is doing, while its doing it, to a store. Then, on any failure, we re-execute your code, fill in any previously stored results, so that it can 'zoom' back to the point where it failed, and continue. It's like a much more efficient and intelligent retry, where the code doesn't have to be idempotent.
Is that true? I don't think that makes any theoretical sense, since I'm pretty sure the whole thing relies on transparent retries for external calls.
If I complete some action that can't be retried and then die before writing it to the log (completing an action unatomically) there would seem to be no way for this to recover without idempotency.
Absolutely, individual atomic side effects need to be idempotent. We can't solve the fundamental distributed system problem there (eg an HTTP 500 - did it actually get executed)
However, the string of operations doesn't need to be idempotent - lets say your handler does 3 tasks A B C, and the machine dies at C. Only C will be re-executed. A and B need to be atomically idempotent, but once we move on, we don't start again
Critical point - its much easier to think about and test for the re-execution of C in a vacuum, than to test for A B C all re-executing in sequence, with a variable number of those having already executed before
Doesn't anything involving requests to other services inherently have to be idempotent because there's still a chance of a communication error resulting in an unknown outcome of the action? You don't know if the "widget order" was successfully placed or not, and therefore there's no way to know if that action can safely be tried again.
That is true, individual steps should be idempotent or undo-able.
But really only each individual one needs to be idempotent, rather than the full sequence, and that makes many situations much easier.
For example, you create a new permissions role and assign it to the user (two steps). If you safely memoize the result from the first step (let's say role uid) then any retries just assign the same role to the user again (which would not make a difference). Without memoizing the step, you might retry the whole process, assign two roles, or create a lot of code to try and figure out what was created before and reconnect the pieces.
You can also use this to memoize generated ids, dry-run-before change, ensure undos run to completion (sagas style), even implement 2PC patterns if you want to.
https://news.ycombinator.com/item?id=40659968
Absolutely, sorry if im not tight enough with my language. Maybe should be described as 'operation idempotency' vs 'handler idempotency'. IMO, an entire handler re-executing is much harder to reason about and test for than a particular operation re-executing individually, with nothing else changing between executions
A special case is if the operation is calling another Restate service. In this case, Restate will make sure that the callee will be executed exactly once and there is no need for the user to pass an idempotency key or something similar. Only when interacting with the external world from a Restate service, the operation needs to be idempotent.
Ah I see what you mean. In this case the handler should complete with a terminal error - we weren't able to finish the task in time. Of course, many types of errors and timeouts are valid application-level results, not transient infrastructure issues. And sadly, tight timeouts push transient issues into application-level issues, and this is unavoidable, I think
So non-response time bound workloads that need to reliably dispatch other processes to completion?
Would a good example be something like, automated highway toll collecting? i.e. I drive past a scanner on the highway, my license plate is scanned and several state bound collection events need to be triggered until the toll is ultimately collected?
Yes, definitely, but we can also cover response time bound tasks! Not just async. Typical p90 of a 3-step workflow is 50ms. Our goal is to run on every RPC, anywhere you need reliability
What if the code was changed by the time it is retried? I imagine it would have to throw away its memorized instructions, and because the code isn’t idempotent…
The way we do that is by writing down what your code is doing, while its doing it, to a store. Then, on any failure, we re-execute your code, fill in any previously stored results, so that it can 'zoom' back to the point where it failed, and continue. It's like a much more efficient and intelligent retry, where the code doesn't have to be idempotent.