It's nice to have a lot of options when managing deployments. Feature flags, canary deploys, A/B or green/blue, etc. Canaries are really nice to catch the majority of deploy issues (i.e 'it doesn't start up' errors, obvious exception spikes or performance regressions). Feature flags let you hand off control to individual product teams and encourage them to create 'bite sized' changes which can be flagged. Blue/green is also a nice way to reduce risk given more complicated cross-service changes where having another 'known good' copy of production around is helpful.
Kudos to the Reddit engineering team. I've been really enjoying their quality posts as of late, and I wish more companies of that scale were as transparent with their engineering problems and solutions. Thank you.
Great write-up. I liked the stage-by-stage structure of this.
I built a tool for the team to coordinate deploy locks at my last job, similar to the IRC bot described here. People seemed to like it a lot over the previous system of just shouting at the rest of the room that they're taking prod.
I think it's strange how little attention this got here on HN. Submitted by one of the Reddit admins even. I'd ask you to do an AMA but I have no questions and besides you lot are pretty good at answering questions whenever they come up anyway so.
How do you handle one way gates? Clearly at each deploy there are two different versions running concurrently, and as you make changes you do so knowing that, but as the system evolves their are points in (code) time that you can't go back to. Is this not a concern because you would never roll back that far?
Yeah, rollbacks are more of adding a revert to the top of the pile so we just make sure we roll back things that can be rolled back. This is important to think about when planning deploys.
> Yeah, rollbacks are more of adding a revert to the top of the pile
Neat. It reminds me of the method used in gaming nowadays, which is to just write a "savepoint" marker in to the message stream instead of pausing the entire game to save state.
When you deploy: Will you lose requests that are currently being processed while you restart the service? Also, will a server still receive requests while the new code is being started?
Generally: no. The worker processes will finish up their current request before shutting down and being reaped. At the whole server level, Einhorn will indeed make sure that requests are still being served as the workers get shuffled out.
Actually I came to think of something but it's not an AMA kind of question but a Reddit question though. I just opted in to the profile beta literally less than an hour ago and I like it but I have a bug report [1] and a feature request / feedback [2] that I'd like to be seen by the specific people that work on the profile portion of Reddit. Who do I contact about that? Or was submitting it to /r/beta sufficient and they'll see it because I've posted it there?
> Do you find yourself browsing Reddit in your spare time or do you avoid it?
I personally still spend plenty of time reading stuff on the site, both educational (r/askhistorians, various programming subreddits) and entertainment (r/popular). I'll definitely tend to avoid the more meta subreddits when not in a work mood though.
> Do you ever find yourself spending whole days at work without being on the site because you don't want to be distracted while working on something?
Oh yeah, it definitely happens: heads down in code, doing a complex series of deploys, or just in a bunch of meetings.
There's also the times where I go to check on the site to check on something I deployed and instead get distracted by something on the front page and forget what I was doing. Oops.
I'd like to pose a follow up question; as your role at the company has progressed, do you feel your day to day (collective you, the [a]'s) is less development oriented and more social management of the different issues that arise when a community gets that large?
Actually quite the opposite. One of the benefits of being a larger company is separation of concerns. Our community team is a bunch of fantastic people that are focused on that side of things and engineering spend most of its time writing code. Obviously there will be overlap from time to time, but it's definitely way less than when we were 10 people.
Does Reddit have democratic deploys (i.e every engineer is responsible to deploy their own changes into production). Or do you guys have a system with release managers?
If you're looking to build a deployment tool from scratch, please consider Spinnaker first. I work for Netflix, and it's an open source Cloud Deployment Tool they developed in-house (http://www.spinnaker.io).
Lot of the features you built over the years were built in tools like capistrano or fabric. Any particular reason on why not use them in the first place?
I've been doing an eval for a new environmental auditing tool at work, and I've found that most of the pre-built solutions out there (e.g. Ansible Tower, Chef Server, some tools that we have written internally) will mostly meet our needs with some coercion, but I decided we should write our own anyway because it gives us the flexibility to only use and maintain the features we're actually going to use.
It's very possible (likely, even) that the Reddit guys looked at fabric or capistrano, and decided either:
1) the tool didn't map to their model of deployments, or
2) the tool did too much and would require more maintenance than a dead-simple solution they wrote themselves.
> why did you choose to write your deployment tools from scratch, instead of going with something like Jenkins?
Each step along the way was basically just a small modification on the system before. AFAIK Jenkins doesn't come with the ability to safely deploy code to hundreds of servers out of the box, so building out the systems to make that the case would've been more work than just adding to what existed and for unknown benefit.
> And, how do/did you provision new servers? By hand, or did you use something like Chef/Puppet?