Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the AWS builders library they suggest setting the timeout to be at the p99 for the expected latency of the operation (or choose a different percentile if you want to be more or less tolerant of false positives). That methodology seems pretty solid, provided it's something that's continually re-evaluated and tested under load.

Also important to consider what the client is advised to do in the case of a timeout. Retries, for instance should likely have backoff and jitter attached, or a retry budget.



That number sounds like really bad advise to me. Should be more like 99.99% in my experience.

Internal services have extremely low response time during normal operation (p99 around a second) but then the database will start a snapshot or a large analytics query hits on the week end (high IO) and the latency is through the roof for a short while. Too bad if services have short timeouts, they're all failing all requests now for no reason.

p99 is normal operation. Services shouldn't be configured to systematically fail for 1% of operations.


fair enough, that's why they call out that you need to load test it and actually determine that the value you set meets expectations. Agreed that blindly setting a value is problematic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: