The Evolution of Code Deploys at Reddit

siliconc0w · on June 3, 2017

It's nice to have a lot of options when managing deployments. Feature flags, canary deploys, A/B or green/blue, etc. Canaries are really nice to catch the majority of deploy issues (i.e 'it doesn't start up' errors, obvious exception spikes or performance regressions). Feature flags let you hand off control to individual product teams and encourage them to create 'bite sized' changes which can be flagged. Blue/green is also a nice way to reduce risk given more complicated cross-service changes where having another 'known good' copy of production around is helpful.

spladug · on June 3, 2017

Feature flags are super useful. We didn't really have room to cover our use of them in this post, but if you're curious you can see the system we currently use for it here: https://github.com/reddit/reddit/blob/master/r2/r2/config/fe...

technion · on June 3, 2017

It also says a lot about how basic deployment schemes are perfectly good at a reasonable scale.

I've seen many projects in the 1-2 server range become messy nightmares as people obsessed with 2017-reddit style deployments.

spladug · on June 3, 2017

Keep it simple and boring as long as you can, and even then go with the boringest solution to your problems.

wyc · on June 3, 2017

Kudos to the Reddit engineering team. I've been really enjoying their quality posts as of late, and I wish more companies of that scale were as transparent with their engineering problems and solutions. Thank you.

artursapek · on June 3, 2017

Great write-up. I liked the stage-by-stage structure of this.

I built a tool for the team to coordinate deploy locks at my last job, similar to the IRC bot described here. People seemed to like it a lot over the previous system of just shouting at the rest of the room that they're taking prod.

eriknstr · on June 3, 2017

I think it's strange how little attention this got here on HN. Submitted by one of the Reddit admins even. I'd ask you to do an AMA but I have no questions and besides you lot are pretty good at answering questions whenever they come up anyway so.

spladug · on June 3, 2017

Thanks! Do feel free to ask if you think of something :)

pacaro · on June 3, 2017

How do you handle one way gates? Clearly at each deploy there are two different versions running concurrently, and as you make changes you do so knowing that, but as the system evolves their are points in (code) time that you can't go back to. Is this not a concern because you would never roll back that far?

spladug · on June 3, 2017

Yeah, rollbacks are more of adding a revert to the top of the pile so we just make sure we roll back things that can be rolled back. This is important to think about when planning deploys.

gusfoo · on June 3, 2017

> Yeah, rollbacks are more of adding a revert to the top of the pile

Neat. It reminds me of the method used in gaming nowadays, which is to just write a "savepoint" marker in to the message stream instead of pausing the entire game to save state.

syncopate · on June 3, 2017

When you deploy: Will you lose requests that are currently being processed while you restart the service? Also, will a server still receive requests while the new code is being started?

spladug · on June 3, 2017

Generally: no. The worker processes will finish up their current request before shutting down and being reaped. At the whole server level, Einhorn will indeed make sure that requests are still being served as the workers get shuffled out.

eriknstr · on June 3, 2017

Actually I came to think of something but it's not an AMA kind of question but a Reddit question though. I just opted in to the profile beta literally less than an hour ago and I like it but I have a bug report [1] and a feature request / feedback [2] that I'd like to be seen by the specific people that work on the profile portion of Reddit. Who do I contact about that? Or was submitting it to /r/beta sufficient and they'll see it because I've posted it there?

[1]: https://www.reddit.com/r/beta/comments/6eyoy4/beta_bug_list_...

[2]: https://www.reddit.com/r/beta/comments/6eyqse/feedback_profi...

spladug · on June 3, 2017

Just checked: r/beta is the right place for that. Thanks for the feedback!

eriknstr · on June 3, 2017

I came to think of a couple of AMA kind of questions also.

When you are an employee of Reddit you obviously spend a lot of time on Reddit.

1. How does this affect your life outside of work. Do you find yourself browsing Reddit in your spare time or do you avoid it?

2. Do you ever find yourself spending whole days at work without being on the site because you don't want to be distracted while working on something?

spladug · on June 3, 2017

> Do you find yourself browsing Reddit in your spare time or do you avoid it?

I personally still spend plenty of time reading stuff on the site, both educational (r/askhistorians, various programming subreddits) and entertainment (r/popular). I'll definitely tend to avoid the more meta subreddits when not in a work mood though.

> Do you ever find yourself spending whole days at work without being on the site because you don't want to be distracted while working on something?

Oh yeah, it definitely happens: heads down in code, doing a complex series of deploys, or just in a bunch of meetings.

There's also the times where I go to check on the site to check on something I deployed and instead get distracted by something on the front page and forget what I was doing. Oops.

jameskegel · on June 3, 2017

I'd like to pose a follow up question; as your role at the company has progressed, do you feel your day to day (collective you, the [a]'s) is less development oriented and more social management of the different issues that arise when a community gets that large?

spladug · on June 3, 2017

Actually quite the opposite. One of the benefits of being a larger company is separation of concerns. Our community team is a bunch of fantastic people that are focused on that side of things and engineering spend most of its time writing code. Obviously there will be overlap from time to time, but it's definitely way less than when we were 10 people.

jensvdh · on June 3, 2017

Does Reddit have democratic deploys (i.e every engineer is responsible to deploy their own changes into production). Or do you guys have a system with release managers?

spladug · on June 3, 2017

Every engineer writes code, gets it reviewed, checks it in, and rolls it out to production regularly.

lapitopi · on June 3, 2017

If you're looking to build a deployment tool from scratch, please consider Spinnaker first. I work for Netflix, and it's an open source Cloud Deployment Tool they developed in-house (http://www.spinnaker.io).

It's excellent, I use it daily.

juanbrein · on June 3, 2017

Lot of the features you built over the years were built in tools like capistrano or fabric. Any particular reason on why not use them in the first place?

nameless912 · on June 3, 2017

Probably simple-is-usually-better.

I've been doing an eval for a new environmental auditing tool at work, and I've found that most of the pre-built solutions out there (e.g. Ansible Tower, Chef Server, some tools that we have written internally) will mostly meet our needs with some coercion, but I decided we should write our own anyway because it gives us the flexibility to only use and maintain the features we're actually going to use.

It's very possible (likely, even) that the Reddit guys looked at fabric or capistrano, and decided either:

1) the tool didn't map to their model of deployments, or 2) the tool did too much and would require more maintenance than a dead-simple solution they wrote themselves.

It's all a matter of perspective.

spladug · on June 3, 2017

Spot on.

mschuster91 · on June 3, 2017

A question out of curiosity: why did you choose to write your deployment tools from scratch, instead of going with something like Jenkins?

And, how do/did you provision new servers? By hand, or did you use something like Chef/Puppet?

spladug · on June 3, 2017

> why did you choose to write your deployment tools from scratch, instead of going with something like Jenkins?

Each step along the way was basically just a small modification on the system before. AFAIK Jenkins doesn't come with the ability to safely deploy code to hundreds of servers out of the box, so building out the systems to make that the case would've been more work than just adding to what existed and for unknown benefit.

> And, how do/did you provision new servers? By hand, or did you use something like Chef/Puppet?

We use puppet for configuration management. That's been the case for most stuff since early 2011 for context. There's a lot more detail in our semi-recent infra/ops AMA: https://www.reddit.com/r/sysadmin/comments/57ien6/were_reddi...

oblio · on June 3, 2017

Jenkins would have bought you deployment queueing, which I see you developed.

On the other hand, since you weren't already using it for builds/running tests, it would have added some overhead.

letientai299 · on June 3, 2017

Off topic.

With the quality of the article, I really wonder why most of reddit open source mentioned in the article are not popular?

Is that because of the lacking of marketing?