Hijacking a bit, but does anyone have any good resources/guides around managing terraform state in larger organizations? Terraform enterprise seems to address this but I was wondering if there's workflows that allowed subsections of infrastructure (think teams or systems) and didn't rely on a re-evaluation of the entire organization's assets. So far the only approach I've seen is having protected high level (VPC, subnets etc) as a separate state and using terraform_remote_state to reference those.
There was a good talk at Hashiconf a few years back, can't find a link now.
I can't say that this is the "right" way to do it, but it scales OK for our org (100+ engineers, 200+ services per environment, many third party and in-house providers). A single team owns the codebase, though all backend engineers expected to write and maintain their own infra code. Some highlights:
* Monorepo with ~ 700 terraform modules, > 1000 terraform workspaces
* CI/CD tooling to work out the graph of workspaces that need to be executed for a given change using merkle tree/fingerprinting, and build dynamic plan/apply pipelines for a given PR
* Strict requirements about master being up to date, and serialised merges managed by a bot, with continuous deployment (i.e apply) on merge
* Templated code/PR generation for common tasks
* Tooling for state moves, lock management etc
* Daily full application of all workspaces with alerting on state drift (e.g externally updated resources)
As an org, we average about 20 infrastructure changes per day through this system.
A few tips:
* Find the right level of abstraction for breaking larger workspaces down into smaller ones. This should be determined by things like rate of change, security requirements, team ownership, and in some cases whether you have a flaky provider that you want to isolate. Size of state should also be a consideration - if a workspace takes 2 minutes to plan, it's too big
* If you start to use lots of remote states, wrap all remote states into a module with a sane interface to make consumption easier. You can also embed rules in this like "workspace X cannot consume from state Y" (e.g because of circular dependencies or security considerations)
* Never embed a provider within a module (I think this is enforced in newer TF versions)
* Terraform is a hammer that will can tackle most nails, but for several problems there are more appropriate tools
If you don't have a CI/CD system that already allows you to deploy 100 changes a day, you can't do large scale monorepos, or you'll get caught up in continuous integration hell.
In that case, the best choice is lots of remote states / data sources, independent modules in independent repos that reuse other modules, and strict adherence to internal conventions, including branching/naming/versioning standards running the gamut from your VCS, to the module, to the code, to the data structures, to the "terraformcontrol" repos, etc. Basically, standardize every single possible thing. If anyone ever needs something to work differently, update the standard before the individual module.
How and when to separate remote states is still a bit of black magic. In general, you can make a new state for each complete unit of deployment. Assuming a deployment has stages, you can separate terraform state into those different stages, so that you can step through them applying as you go and stopping if you detect a problem. The biggest mishap is when you're trying to apply 100 changes and your apply fails half way, and you have to stop the world to manually fix it, or revert, which may not even work. It's much easier to manage a change that affects a few resources than lots of them.
I really wish there were more in depth articles and tutorials around this topic as it was a pretty big pain point when I started out. It’s been a little over a year now since we started using terraform for our aws infrastructure and here is how we set it up:
- /modules holds a bunch of common modules. As an example, we have an aws_application module to setup application configuration in system manager, and an ecr repository for the docker image.
And then we our workspace folder (does not actually use terraform’s concept if workspaces). This goes down in specificity and uses terraform_remote_state for values from the previous workspaces. Our CI runs terraform automatically in this order:
- /workspaces/aws sets up our main aws account. No applications actually run here, it’s just iam setup and some random configuration values.
- /workspaces/production sets up a lot of the backend. We aren’t a big company so we don’t have to deal with cross region databases or redis clusters. This is also where we would call the aws_application module and setup a ecr repo.
- /workspaces/production-us-east-2 is where we setup the ecs cluster, task definitions, load balancers, and dns routes. We’re small, so we only have one region, but the idea is we could copy this folder to another region and horizontally scale super easy.
Then we have the same folders for our staging setup, only with some scale values and configuration tweaked.
Overall I’m pretty happy with this solution. It keeps any issues from spreading far due to the different state files. Though this can get pretty ugly if you need to use some region information in the environment workspace (production-us-east-2 values in production). I also can’t comment on how well it scales out past what we have done so far.
The way we do it is
1) each team owns one or more of AWS sub accounts (e.g a particular app or function will be in its own account)
2) An internal version of this is used to establish and enforce company-wide standards: https://github.com/cloud-custodian/cloud-custodian
3) A repository of terraform modules is shared amongst teams to standardize on how common AWS resources are used (e.g. enforce X, Y and Z for S3 buckets)
This way, the per account setup (represented as a repo) is relatively small, common patterns are standardized, and there is still room for experimentation.
When it was up to me, I managed things with small projects, rigorous standardization of naming and tagging, and a shared “metadata” module that could look up all the details you needed based on region, account (from the provider) and VPC name. Takes some discipline but makes for much more efficient Terraform, IMO.