I never got this approach for three reasons. The first one is sure, you could sa...

maratd · on Nov 14, 2011

> So you're choosing potentially ignoring tiny issues instead of letting them crash the system in a visible way and fixing them properly forever.

This makes the assumption that I am able to fix them. Few people realize it, but you have very little control over the system. You can easily fix configuration errors, but if the error is due to a fundamental defect in the source code of a critical package, there's nothing you can do. I do not have the time to write patches for every defect in the system, nor do I have the luxury to tolerate them. Quite a conundrum, isn't it?

> The second is that there shouldn't be any "cruft".

There's always cruft. If you plot the uptime for servers, you will find that it looks like a very steep bell curve. Very few servers run for a few days, but also, very few servers run for years at a time. Most run for a month or two, or three.

In practical terms, this means that the bugs that get fixed first are the ones that crop up immediately on boot (everyone experiences them). The bugs that get fixed next are the ones that crop up for the average user or the middle of the bell curve (server running for a month or two). The bugs that get fixed last, if at all, are the ones that crop up for the fewest users (server running for years). This article is an example of that.

So by running your server for years at a time, you are exposing yourself to a greater amount of unfixed bugs. Also, memory leaks get worse over time and even a really minor one can mean serious issues over the span of years.

> The third one is that if you cannot say anything more specific than "cruft", then your system is badly managed.

I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure. Preventative maintenance.

First, if there is a power failure or some other problem that forces the server to reboot unexpectedly, I know it will come back up. I know that because I have designed the system to reboot regularly and I test that capability on every server, after every update.

Second, I am avoiding problems that crop up for systems that are operating outside of the average uptime.

Third, my servers boot within 1 minute at most. What's my cost for doing this? 1 minute of uptime in the middle of the night once per month? So be it.

viraptor · on Nov 14, 2011

> Few people realize it, but you have very little control over the system.

That might be the difference in our POVs... Most of the time I work in environments where we do have control over the whole system, or at least aim for it.

> If you plot the uptime for servers, you will find that it looks like a very steep bell curve.

Unless they kernel paniced, systems I took care of run from one kernel update to the other. I never experienced the "cruft" in any way.

> I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure.

So you don't know of anything going wrong. You're not fixing anything by restarting. You're restarting just in case... it prevents something from breaking. I guess I just disagree with that reasoning.

count · on Nov 14, 2011

How do you know you can restore your system to a working state in the event of an unscheduled outage, cruft or not?

You should discriminate between services and systems - make your service available 100% of the time, but you should be able to kill and restart/reload/replace systems for maintenance or other reasons at nearly any time. And you SHOULD do that, because without proof that you can do it, your DR solution is simply a best guess.

mburns · on Nov 14, 2011

By forcing Configuration Management Software (puppet, cfengine) so that one-off fixes that get hot patched on the production server and never documented.

count · on Nov 14, 2011

This goes a very long way to helping, yes, but by itself does not guarantee anything. And not everything can be cfengined or puppeted.

viraptor · on Nov 14, 2011

You can test for that in a staging environment. Restarting live services doesn't help you guarantee anything, but makes it more likely that you restart some server in a specific situation you can't recover from. (you'll never guarantee that you can recover from all situations)

dennisgorelik · on Nov 14, 2011

Reboot every 30 days transates into 120 reboots over the course of 10 years.

In 10 years you are likely to have your application being re-written anyway (with old bugs removed with old code and new bugs introduced).

It may be cheaper to do 120 reboots that debug and fix unknown cruft.