More

pavel_pt · 2025-10-21T09:02:38 1761037358

I worked at AWS (EC2 specifically), and the comment is accurate.

Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.

End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.

cudgy · 2025-10-21T10:35:31 1761042931

“There are a variety of mechanisms to prioritize and delay issues until business hours”

What are business hours for a global provider of critical tech services?

pavel_pt · 2025-10-21T11:13:01 1761045181

Business hours for the team receiving the alarm; many issues can wait to be resolved during your own waking hours if they are not impacting customers.

lljk_kennedy · 2025-10-21T13:35:12 1761053712

"This is important enough for someone to work on as soon as their shift starts, but not important enough to page someone out of bed for."

pavel_pt · 2025-05-18T06:44:57 1747550697

Fantastically insightful; surprised this didn't gain more traction here. Curious if the gook contains deeper insights still.

pavel_pt · 2025-02-20T15:52:00 1740066720

I think things like the Slack conference call integration depend on this. In fact that’s the only use outside of the official client I’m aware of.

pavel_pt · on July 21, 2024

If that’s all it takes an attacker, you’re doing AWS wrong.

snotrockets · on July 21, 2024

Problem is that many do.

Doing it right requires very capable individuals and a significant effort. Less than it used to take, more than most companies are ready to invest.

hello_moto · on July 21, 2024

This is the real world, everyone is doing something wrong.

The alternative is to replace you with AI yes?

ironbound · on July 21, 2024

people get lazy

pavel_pt · on June 12, 2024

I hope @sewen will expand on this but from the blog post he wrote to announce Restate to the world back in August '23:

> Stateful Functions (in Apache Flink): Our thoughts started a while back, and our early experiments created StateFun. These thoughts and ideas then grew to be much much more now, resulting in Restate. Of course, you can still recognize some of the StateFun roots in Restate.

The full post is at: https://restate.dev/blog/why-we-built-restate/

pavel_pt · on June 12, 2024

Restate also stores a deployment version along with other invocation metadata. FaaS platforms like AWS Lambda make it very easy to retain old versions of your code, and Restate will complete a started invocation with the handlers that it started with. This way, you can "drain" older executions while new incoming requests are routed to the latest version.

You still have to ensure that all versions of handler code that may potentially be activated are fully compatible with all persisted state they may be expected to access, but that's not much different from handling rolling deployments in a large system.

pavel_pt · on June 12, 2024

Disclaimer: I work on Restate together with @p10jkle.

You can absolutely do something similar with a RDBMS.

I tend to think of building services in state machines: every important step is tracked somewhere safe, and causes a state transition through the state machine. If doing this by hand, you would reach out to a DBMS and explicitly checkpoint your state whenever something important happens.

To achieve idempotency, you'd end up peppering your code with prepare-commit type steps where you first read the stored state and decide, at each logical step, whether you're resuming a prior partial execution or starting fresh. This gets old very quickly and so most code ends up relying on maybe a single idempotency check at the start, and caller retries. You would also need an external task queue or a sweeper of some sort to pick up and redrive partially-completed executions.

The beauty of a complete purpose-built system like Restate is that it gives you a durable journal service that's designed for the task of tracking executions, and also provides you with an SDK that makes it very easy to achieve the "chain of idempotent blocks" effect without hand-rolling a giant state machine yourself.

You don't have to use Restate to persist data, though you can - and you get the benefit of having the state changes automatically commit with the same isolation properties as part of the journaling process. But you could easily orchestrate writes into external stores such as RDBMS, K-V, queues with the same guaranteed-progress semantics as the rest of your Restate service. Its execution semantics make this easier and more pleasant as you get retries out of the box.

Finally, it's worth mentioning that we expose a PostgreSQL protocol-compatible SQL query endpoint. This allows you to query any state you do choose to store in Restate alongside service metadata, i.e. reflect on active invocations.

pavel_pt · on June 12, 2024

We don't have specific plans for our next SDK to build, but Python definitely comes up often - thank you for the input!

pavel_pt · on June 12, 2024

Appreciate the feedback! What kind of support do you wish for, if there was one thing you would prioritize?

whoiskatrin · on June 12, 2024

Pull handlers would make integration much easier, I think

AhmedSoliman · on June 12, 2024

Agreed.

pavel_pt · on Feb 23, 2024

I think you do it just like you handle compatibility between services – you never remove parameters; you only ever add new optional ones if you have to. This way a message from the past will be compatible with a future handler, same as if you have a caller that depends on you which is using an outdated client/API definition.

But you are right; it's very hard to reason about testing such systems, since you may have accumulated state which causes your handler logic to behave differently. The problem exists in service architectures in general though, it's just very hard to miss with intentionally delayed processing.