how do tools like this handle evolving workflows? e.g., if I have a "durable wor...

p10jkle · on June 12, 2024

I wrote two blog posts on this! It's a really hard problem

https://restate.dev/blog/solving-durable-executions-immutabi...

https://restate.dev/blog/code-that-sleeps-for-a-month/

The key takeaways:

1. Immutable code platforms (like Lambda) make things much more tractable - old code being executable for 'as long as your handlers run' is the property you need. This can also be achieved in Kubernetes with some clever controllers

2. The ability to make delayed RPCs and span time that way allows you to make your handlers very short running, but take action over very long periods. This is much superior to just sleeping over and over in a loop - instead, you do delayed tail calls.

delusional · on June 12, 2024

> Immutable code platforms (like Lambda) make things much more tractable

My job is admittedly very old-school, but is that actually doable? I dont think my stakeholders would accept a version of "well we can't fix this bug for our current customers, but the new ones wont have it". That just seems like a chaos nobody wants to deal with.

p10jkle · on June 12, 2024

I don't personally believe this immutability property should be used for handlers that run for more than say 5 minutes. Any longer than that, I'd suggest the use of delayed calls, which explicitly will serialise the handler arguments instead of saving the whole journal. I agree executing code that is even just an hour old is unacceptable in almost all cases.

Obviously you can still sleep for a month, but I really see no way to make such a handler safely updatable without editing the code to branch on versions, which can become a mess really quick (but good for getting out of a jam!)

yaj54 · on June 12, 2024

ah! this took me a second to grok, but from #2 above: "we just want to send the email service a request that we want to be processed in a month. The thing that hangs around ‘in-flight’ wouldn’t be a journal of a partially-completed workflow, with potentially many steps, but instead a single request message."

I'll have to think through how much that solves, but it's a new insight for me - thanks!

I like that you're working on this. seems tricky, but figuring out how to clearly write workflows using this pattern could tame a lot of complexity.

p10jkle · on June 12, 2024

It's always been a lively topic within Restate. The conversation goes a bit like this

> Let users write code how they want, its our job to make it work!

> Yes, but it's simply not safe to do this!

I think we need to offer our users a lot of stuff to get it right:

1. Tools so they know when a deploy puts in-flight invocations at risk, or maybe even in their editor, showing what invocations exist at each line of a handler

2. Nudge towards delayed call patterns whereever we can

3. Escape hatches if they absolutely have to change a long-running handler - ways to branch their code on the running version, clever cancellation tricks, 'restart as a new call' operation

Sadly no silver bullet. Delayed calls get you a lot of the way though :p

rockostrich · on June 12, 2024

My org solved this problem for our use case (handling travel booking) by versioning workflow runs. Most of our runs are very shortlived but there are cases where we have a run that lasts for days because of some long running polling process e.g. waiting on a human to perform some kind of action.

If we deploy a new version of the workflow, we just keep around the existing deployed version until all of its in-flight runs are completed. Usually this can be done within a few minutes but sometimes we need to wait days.

We don't actually tie service releases 1:1 with the workflow versions just in case we need a hotfix for a given workflow version, but the general pattern has worked very well for our use cases.

p10jkle · on June 12, 2024

Yeah, this is pretty much exactly how we propose its done (restate services are inherently versioned, you can register new code as a new version and old invocations will go to the old version).

The only caveat being that we generally recommend that you keep it to just a few minutes, and use delayed calls and our state primitives to have effects that span longer than that. Eg, to poll repeatedly a handler can delayed-call itself over and over, and to wait for a human, we have awakeables (https://docs.restate.dev/develop/ts/awakeables/)

More discussion: https://restate.dev/blog/code-that-sleeps-for-a-month/

delusional · on June 12, 2024

Conceptually I think the only thing these tools add on to the mental model of separation of data and logic is that they also store the name of next routine to call. The name is late bond, so migration would amount to switching out the implementation of that procedure.

pavel_pt · on June 12, 2024

Restate also stores a deployment version along with other invocation metadata. FaaS platforms like AWS Lambda make it very easy to retain old versions of your code, and Restate will complete a started invocation with the handlers that it started with. This way, you can "drain" older executions while new incoming requests are routed to the latest version.

You still have to ensure that all versions of handler code that may potentially be activated are fully compatible with all persisted state they may be expected to access, but that's not much different from handling rolling deployments in a large system.

p10jkle · on June 12, 2024

not necessarily - we store the intermediary states of your handler, so it can be replayed on infrastructure failures. if the handler changes in what it does, those intermediary states (the 'journal') might no longer match this. the best solution is to route replayed requests to the version of the code that originally executed the request, but: 1. many infra platforms dont allow you to execute previous versions 2. after some duration (maybe just minutes), executing old code is dangerous, eg because of insecure dependencies.

delusional · on June 12, 2024

I was of course just thinking about the "front" of the execution, when you're sleeping for 2 days and you want to switch out a future step. Switching out logic that has already been committed is a harder problem. That's a goo point.

> after some duration (maybe just minutes), executing old code is dangerous, eg because of insecure dependencies.

Could you elaborate on that? My understanding is that all of this tech builds on actions being retried in an "eventually consistent" manner. That would seem to clash with this argument.

p10jkle · on June 12, 2024

> Could you elaborate on that?

What I mean is that executing a software artifact from, lets say, a month ago, just to get month-old business logic, is extremely dangerous because of non-business-logic elements. Maybe it uses the old DB connection string, or a library with a CVE. Its a 'hack' to address old code versions in order to get the business logic that a request originally executed on - a hack that I feel should be used for minutes, not eve hours.

p10jkle · on June 12, 2024

> I was of course just thinking about the "front" of the execution, when you're sleeping for 2 days and you want to switch out a future step. Switching out logic that has already been committed is a harder problem. That's a goo point.

You make a good point - this is the idea behind 'delayed calls' which are really one of my favourite things about Restate. Don't save all the intermediary state - just serialise the service name, the handler name, and the arguments, and store that for a month or whatever. That is a very tractable problem - ie just request object versioning