Our goal is to make it easier to write code that handles failures - failed outbo...

delusional · on June 12, 2024

> where the code doesn't have to be idempotent

Is that true? I don't think that makes any theoretical sense, since I'm pretty sure the whole thing relies on transparent retries for external calls.

If I complete some action that can't be retried and then die before writing it to the log (completing an action unatomically) there would seem to be no way for this to recover without idempotency.

p10jkle · on June 12, 2024

Absolutely, individual atomic side effects need to be idempotent. We can't solve the fundamental distributed system problem there (eg an HTTP 500 - did it actually get executed) However, the string of operations doesn't need to be idempotent - lets say your handler does 3 tasks A B C, and the machine dies at C. Only C will be re-executed. A and B need to be atomically idempotent, but once we move on, we don't start again

Critical point - its much easier to think about and test for the re-execution of C in a vacuum, than to test for A B C all re-executing in sequence, with a variable number of those having already executed before

johtso · on June 12, 2024

Doesn't anything involving requests to other services inherently have to be idempotent because there's still a chance of a communication error resulting in an unknown outcome of the action? You don't know if the "widget order" was successfully placed or not, and therefore there's no way to know if that action can safely be tried again.

sewen · on June 12, 2024

That is true, individual steps should be idempotent or undo-able.

But really only each individual one needs to be idempotent, rather than the full sequence, and that makes many situations much easier.

For example, you create a new permissions role and assign it to the user (two steps). If you safely memoize the result from the first step (let's say role uid) then any retries just assign the same role to the user again (which would not make a difference). Without memoizing the step, you might retry the whole process, assign two roles, or create a lot of code to try and figure out what was created before and reconnect the pieces.

You can also use this to memoize generated ids, dry-run-before change, ensure undos run to completion (sagas style), even implement 2PC patterns if you want to.

p10jkle · on June 12, 2024

https://news.ycombinator.com/item?id=40659968 Absolutely, sorry if im not tight enough with my language. Maybe should be described as 'operation idempotency' vs 'handler idempotency'. IMO, an entire handler re-executing is much harder to reason about and test for than a particular operation re-executing individually, with nothing else changing between executions

stsffap · on June 12, 2024

A special case is if the operation is calling another Restate service. In this case, Restate will make sure that the callee will be executed exactly once and there is no need for the user to pass an idempotency key or something similar. Only when interacting with the external world from a Restate service, the operation needs to be idempotent.

fire_lake · on June 12, 2024

This assumes that the APIs work this way?

What if the first call is to get a resource that expires and then the last call fails?

Now it will retry but with an expired resource (first call is saved).

p10jkle · on June 12, 2024

I think you would need to validate the response from the first call before determining it to be a success?

fire_lake · on June 12, 2024

First call: fetch a widget

Success! Your widget expires in 30 seconds

Second call: use widget

Failure! For some reason or another

Ok, so restart the flow…

First call: fetch a widget

Cached! Receive the same widget again

Second call: use widget

Failure! widget has now expired

p10jkle · on June 12, 2024

Ah I see what you mean. In this case the handler should complete with a terminal error - we weren't able to finish the task in time. Of course, many types of errors and timeouts are valid application-level results, not transient infrastructure issues. And sadly, tight timeouts push transient issues into application-level issues, and this is unavoidable, I think

rubyfan · on June 12, 2024

So non-response time bound workloads that need to reliably dispatch other processes to completion?

Would a good example be something like, automated highway toll collecting? i.e. I drive past a scanner on the highway, my license plate is scanned and several state bound collection events need to be triggered until the toll is ultimately collected?

p10jkle · on June 12, 2024

Yes, definitely, but we can also cover response time bound tasks! Not just async. Typical p90 of a 3-step workflow is 50ms. Our goal is to run on every RPC, anywhere you need reliability

corytheboyd · on June 12, 2024

What if the code was changed by the time it is retried? I imagine it would have to throw away its memorized instructions, and because the code isn’t idempotent…

p10jkle · on June 12, 2024

Great question! https://news.ycombinator.com/item?id=40659687