Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No downtime is acceptable, but they have only one server?

What if a technical failure happen? What if there's a fire in the server room? What if there is an earthquake and the building collapses? What if... many things can happen that can result in a long, long downtime with this tactics.

If uptime is so crucial, the system should be setup in such way that moving one server should be a peace of cake, not a spec-ops mission.



> Should have been a 5 minute job if done correctly. Owner ended up paying for over 10 hours of work. Stupidest thing I've ever had to do.

You can see the common sense ship has sailed.


You’d be shocked how rare downtime is with modern hardware. A redundant power supply and SSDs in the right RAID configuration typically will not have any issues for years until it can be replaced by a newer model. Also, hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

In the past power supplies and spinning disc hard drives would fail much more often.

It’s basically a solved problem, outside of extremely mission critical, 5 nines kind of stuff, that we all forgot because of AWS.

HN ran, and may still run, on a single bare metal server.


> HN ran, and may still run, on a single bare metal server.

I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage. (But well, running stuff on a single bare-metal server is expensive enough that even if they could, I expect they don't.)

What would that company do if a pipe broke inside the datacenter? Besides, if you never restart your servers, you are guaranteeing that the one time when the power goes off on the entire city, they won't come back online.


> I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage.

HN is probably not business-critical and could probably affort a 10 hour downtime without much hassle.


The point is that they probably also wouldn't then insist on a consultant doing an unreasonable migration and threatening to not pay them if there was downtime. And they probably wouldn't call around to other consultants with the same requirements, apparently telling them that the first consultant refused to do the job.


> apparently telling them that the first consultant refused to do the job.

While I don’t think they informed them of this in good-faith, it is a nice heads-up. In this case, it meant Consultant2 consulting RefusingConsultant that probably knew the IT better.


It would be legitimately interesting if a 10 hour downtime of HN was at all correlated to an increase in github commits.

I hope there wouldn't be a correlation, but I wouldn't be all that surprised if a somewhat loose one was found.


Quality hardware has existed for years. At a ford motor plant they were doing an inventory and couldn't locate a 10 ton mainframe. It was working so well for 15 or so years the tribal knowledge of where it was physically located was lost.


Wow, that's impressive losing that big a piece of hardware.

Though it was likely easier to find than that Novell Netware server that was sealed behind some drywall, with only a stray network cable leaving any clue as to where it was.


Depends on how big the building is that houses it – manufacturing IT can deal with impressive floor spaces.

I once only half jokingly suggested finding a missing data closet in a two million square foot distribution center by pinging a known IP from three or four aggregator switches across the building and triangulating the location on a floor plan. Sadly the people crawling around the ceiling found it before I could put my idea into practice.


2Msqft is c.430m x 430m for a square floorplan. Ping resolution is 1us (microsecond). Speed of electrical signal in cooper is about 0.8c. Gives a max resolution of ~240m by my reckoning. If there are variances in the switch+network delay it seems like you're going to struggle to even say which side of the building it is.

Good job they found it!


Hah! Good math. Based on the switch placement and the building being more of a rectangle I figured "north side or south side" would be as close as I could get. And when we really dug in it was a classic last mile problem: the first several core switches were well known, we just needed to figure out where the last aggregate switch went.

Turns out a door was closed and a new one built to a hallway to another hallway and not properly labeled on the updated drawings. Had one of the boxes running a conveyor belt not have died, we'd never have looked.


This is all true, but you still can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Also, that doesn't cover other problems mentioned here, like natural disasters, ISP problems, etc.


Often these kinds of SLAs are decided upon based on blame rather than what is reasonably required by the customers of that system. In this case, moving offices means the downtime is due to internal reasons. But if an ISP goes down or there is a natural disaster, then that isn't in their control.

Also cost does come in play as well. Multiple physical links in would be very expensive for what sounds like internal services. Likewise a natural disaster might cause bigger issues to the company than those internal services going down. They might still have offsite back ups (I'd hope they would!) so at least they can recover the services but the cost of having a live redundancy system off site might not justify those risk factors.

The customers requires are definitely unreasonable though. I'd hope those systems are regularly patched, in which case when is downtime for that scheduled and why is that acceptable but not when you're physically moving the server? I doesn't really make much sense; but then "not making much sense" also quite a common problem when providing IT services for others.


You are right, their SLA can be a bit different from what we're talking about here (and expect).

In general, we don't know much about this case. It's a post on Reddit, might not even be true. As is, it doesn't make much sense, but we don't know all the details, so maybe we jumped to conclusions.


> can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Mainframe is not just a server. You can hot plug RAM on these things.


Still, sooner or later, the data center will be hit by a natural disaster, a DoS attack, a network problem, or the like, and you'll have to be ready to move to a different one to get your service back online. Or you'll have to reboot your server to apply a critical kernel security update, in which case you need to be ready to fail over to a hot standby. So, since relying on a single server with high-uptime hardware is penny-smart and pound-foolish, might as go with a cloud-style architecture with commodity hardware.


I use to be fascinated with datacenters and would masquerade as a customer prospect to get a tour and see all the cool gear. I was asking one engineer about what they're plan was for a tornado (this was at ThePlanet in Dallas TX way back when) and they basically scoffed at the question. A week or so later one briefly touched down about 1/4 mile from them, I wonder if they thought about me when the sirens were going off hah.


Even in modern hardware there are plenty of single points of failure.

Single server and "can't tolerate any downtime" are mutually exclusive.


AWS and older hardware is no different. Set it once and it keeps running for many years.

I've came across old AWS account (startup have been using AWS for the longest). All the network traffic or VPN goes through a single instance with 3 years of uptime.


AWS EC2 instances or their host machines can fail at any time and it’s out of your hands.


True fact! I recently had EC2 migrate my VM when the physical server it was on reached EOL. If they had fired my VM up again, I wouldn't have even noticed. They didn't. Fortunately it had an EBS volume and I was able to manually restart it without data loss.


Physical servers can fail at any time and it's out of your hands. ;)


Human error is a bigger cause of downtime than technical failure or natural disasters. And in practice, a single server like this tends to be a hand managed one-off which only exasperates the human error component.


s/exasperates/exacerbates/


It's probably a bit of both, TBH. ;)


Unfortunately complacency about how reliable modern hardware is can lead to neglecting things like off site backups. And other issues. Yeah your one big critical on premises server may be super reliable. But what happens when the building is flooded with 6 ft of water, catches on fire, is leveled in an earthquake, or anything else?

If a function is super critical to business, it also deserves to have some thought put into the blast radius of its failure.

The sort of places that would insist on rolling a live server 700 ft across a parking lot probably don't have any real disaster recovery plan.


>hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

There's SMART for disks... what else?


And multiple power supplies. I have been running a single physical server like this for ~10 years and the only downtimes were me restarting to boot a new kernel and when people at datacenter messed up BGP routing (their fault). HW is really very reliable now, especially in datacenter environment. But still not 100% of course. There is still low, but more lower than most think, probability of it failing. IC chips most likely won't break, only some capacitors degrade over time and flash memories with bios normally guarantee only 10 years. Bios upgrade (new write) would prolong that, though. I had one disk fail in RAID. Changed the drive without any downtime.


ECC for RAM is the other big one. A single-bit error will trigger warnings, so that you can replace the faulty DIMM before it progresses into uncorrectable errors.


Is there a tool that can randomly take 128mb chunks of memory out of the pool and test them around the clock?


>HN ran, and may still run, on a single bare metal server.

HN also has downtime fairly often.


Yeah that's how you end up with 3years uptime on some forgoten servers... :)


Which is why AWS instances should be no more than minions in a load balancer pool, and any permanent state on an EBS volume or a managed storage service.


What's the current advice on SSD RAIDs?


From an ISP perspective this seems like the sort of company that orders one $250 a month business DIA circuit (at a price point where there is no ISP ROI for building a true ring topology to feed a stub customer) and has no backup circuit. Then the inevitable happens like a dump truck 2km away with a raised dump driving through aerial fiber and causing an 18 hour outage.

Some circuits might average 5 to 7 nines of uptime over a year, but the next year is dump truck time... You can never truly be certain.


I worked at my last job for a place with a single rack mounted set of Windows servers at a data center - with no backup power supply, no backups of any kind for that matter, no UPS and no redundancy of any system, plus they didn't even have an admin for 6 months. The CEO refused to spend money on a 2nd anything. The company has 2000 employees. One server held all of the companies photos (which is basically the core of the business) and of course was not backed up.


This is the kind of company that could benefit immensely from a ransomware attack.


My boss refused to use UPSs for years because he bought one once and couldn't get it to stop beeping.


Of course it can work, you can get far with one server and no spending on anything like backups, UPS, etc.

Whether it's smart and good for your business/reputation is a different question.


You wrote one server but describe the failure modes of having one data center. I think it is very very uncommon and hard to allow for data center level issue. After all Instagram and 100 other site failed when one AWS data center went down. I would interested to know how/whether anyone's backend will work if any data center and its databases completely fails due to fire/earthquake/networking etc.

Second thing is having multiple machines for server. In theory it might help in increasing the availability but in practice I haven't seen any random issue due to machine which occurs just based on probability. I think almost all failure modes that exist, they are correlated between machines. eg suppose you have data loss on one machine, you could more likely than not, blame it on code and it would be similar across machines.


Re: single datacenter. At the basic level, you need a second datacenter with enough machines to provide your service (or a emergency version at least), replication of data, and a way to switch traffic. It's doable, but expensive in capital and development. If you're dependant on outsourced services, they also need to be available from both datacenters and not served from only one. In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure (IBM had one recently).

Re: multiple servers. Power supplies fail, memory modules fail, cpus fail, fans fail, storage drives fail. Sometimes those are correlated --- the HP SSDs that failed when the power on hours hit a limit (two separate models) are going to be pretty correlated if they were purchased new and stuck into servers at a similar time and then on 24/7. Most of those failures aren't that correlated though. Software failures would be more likely to be correlated though, of course.

The key thing is to really think about what the cost for being down is, how long is acceptable/desirable to be down, and how much you're willing to spend to hit those goals.


> In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure

I can't understand this. I think transferring servers would be the the least of problems. Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me. Is there any writeup of IBM thing you mentioned?


Re: IBM outage

https://news.ycombinator.com/item?id=23471698

TLDR is connectivity to and from the IBM cloud datacenters (which includes softlayer) was generally unavailable, globally, for a couple hours. If you were in multiple IBM datacenters, you were as down as if you were in only one (mostly, I was poking around when it was wrapping up, and some datacenters came back earlier than others).

> Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me

The gold standard here is two-phase commit. Of course, that subjects every transaction to delay, so people tend not to do that. The close enough version is MySQL (or other DB) replication, monitor that the replication stream is pretty current and hope not a lot is lost when a datacenter dies. There's room to fiddle with failover and reconciliation; I recommend against automatic failover for writes, because it gets really messy if you get a split brain situation --- some of your hosts see one write server available and others see another, and you may accept conflicting writes. A few minutes running like that can mean days or weeks of reconciliation, if you didn't build for reconciliation.


He should have taken it offline without notifying this brain-dead manager. Probably wouldn't have noticed lol.

And then charge for those 5 hours for good measure.

In general, this stupid trend of wanting 0 downtime makes no sense to me. If you're not NASA, police or other emergency service you 100% can afford a few hours of downtime with scheduling it be forehead.


We used to have one server for a website I was a content guy on - it was in a standard PC case, plugged into a switch in the IT team's office (this was not a tech-centered org).

The main IT guy went on holiday and one of the cover guys from another office decided to tidy up. He unplugged the server and thought (and told me after his thought process) "if anyone was using it, they'll let us know".

This was the one, single box for the whole website - no one else was monitoring (even though the central office had a proper, dedicated web team) and the assumption was I was sysadmin.

An hour later I'm sprinting down the corridor to find out what the hell happened and why I can't even SSH into the box.

We put a sticker on the case saying not to unplug it after that...


Remind me of how IBM positions mainframes: they are so highly available that you simply never let them shut down.


IBM Mainframes are designed to be serviced while running so if you have multiple CPUs you can offline one at a time for upgrade it without the whole mainframe going down. Big Sun Solaris boxes where built like at as well.

If your mainframe had only one CPU, you did have to turn it off in order to service it. But you could upgrade the OS without turning it off. While they aren't cool tech now, mainframes are a marvel of hardware engineering.


plus, i would imagine turning them on and bringing them online isn't just a press of a button.


It's not. https://web.archive.org/web/20190324191654/https://www.ibm.c...

(archive.org link because ibm.com apparently isn't hosted on a mainframe.)


Never mind these less common scenarios... What do they do about Windows updates?


or even better, how do they apply OS patches?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: