Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It works well of the company empowers engineers to write software that doesn't break all the time.

At HBO Max every incident had a full writeup and then real solutions were put in place to make the service more stable.

My team had around 3 incidents in 2 years.

If the cultural expectation is that the on call buzzer will never go off, and that it going off is a Bad Thing, then on call itself isn't a problem.

Or as I was fond of saying "my number one design criteria (for software) is that everyone gets to sleep through the night."

The customers win (stable service) and the engineers win (sleep).



What was the time to implementation for the real solutions?

Was there any cost considerations associated with prioritizing the implementation or even limiting the scope of the solution?


The tooling there was amazing so a barebones services could get deployed into prod in a couple days if needs be.

That typically didn't happen because engineering reviews had to occur first.

A single command created a new repo, setup ingress/egress configs in AWS, and setup all the boilerplate to handle secrets management, environment configs, and the like.

All that was left to do was business logic.


It sounds like if you have production issues, there is a priority to put out an actual fix immediately.


It depends on the severity of the issue.

If the issue impacts tens of millions of customers, then yes, get it fixed right now. Extended outages can be front page news. Too many in a row and people leave the service.

Ideally monitoring catches outages when they first get started and run books have steps to quickly restore service even if a full fix cannot be put into place immediately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: