Heap’s infrastructure runs on AWS, and we manage it using Terraform. This post is a collection of tips and gotchas we’ve picked up along the way.
Terraform and infrastructure as code
Terraform is a tool from Hashicorp to help manage infrastructure declaratively. Instead of manually creating instances, networks, and so on in your cloud provider’s console or command line client, you write configuration that describes what you want your infrastructure to look like.
This configuration is in a human-readable text format. When you want to modify your infrastructure, you modify the configuration and run
terraform apply. Terraform will make API calls to your cloud provider to bring the infrastructure in line with what’s defined in the configuration.
Moving our infrastructure management into text files allows us to take all our favorite tools and processes for source code and apply them to our infrastructure. Now infrastructure can live in source control, we can review it just like source code, and we can often roll back to an earlier state if something goes wrong.
As an example, here’s a Terraform definition of an EC2 instance with an EBS volume:
If you haven’t tried Terraform yet, the getting started guide is quite good, and will quickly get you familiar with the workflow.
Terraform’s data model
At a high level Terraform has a simple data model: it manages resources, and resources have attributes. A few examples from the AWS world:
an EC2 instance is a resource with attributes like the machine type, boot image, availability zone, security groups
an EBS volume is a resource with attributes like volume size, volume type, IOPS
an Elastic Load Balancer is a resource with attributes for its backing instances, how it checks their health, and a few others
Terraform maintains a mapping between resources defined in configuration and the corresponding cloud provider resources. The mapping is called the state, and it’s a giant JSON file. When you run
terraform apply, Terraform refreshes its state by querying the cloud provider. Then it compares the returned resources against what you have in your Terraform configuration.
If there are any differences it will create a plan, which is a set of changes to the resources in your cloud provider to match your configuration. Finally it applies those changes by making calls to your cloud provider.
Not every Terraform resource is an AWS resource
This resources-and-attributes data model is not too hard to understand, but it doesn’t necessarily match the cloud provider APIs perfectly. In fact, a single Terraform resource can correspond to one, more than one, or even zero underlying entities in your cloud provider. Here are some examples from AWS:
aws_ebs_volumecorresponds to one AWS EBS volume
aws_instancewith an embedded
ebs_block_deviceblock as in the example above corresponds to two EC2 resources: the instance and the volume
aws_volume_attachmentcorresponds to zero entities in EC2!
The last one might be surprising. When you create an
aws_volume_attachment, Terraform will make an
AttachVolume request; when you destroy it, it will make a
DetachVolume request. There’s no EC2 object involved: Terraform’s
aws_volume_attachment is completely synthetic! Like all resources in Terraform, it has an ID. But where most have an ID that comes from the cloud provider, the
aws_volume_attachment‘s ID is simply a hash of the volume ID, instance ID, and device name.
Synthetic resources show up in a few other places in Terraform, for example
aws_security_group_rule. One way to spot them is to look out for
attachment in the resource name, though not always.
There’s more than one way to do it, so choose carefully!
With Terraform, there can be more than one way to represent exactly the same infrastructure. Here’s another way to represent our example instance with and EBS volume in Terraform that results in the same EC2 resources:
Now the EBS volume is a Terraform resource in its own right, distinct from the EC2 instance. There’s also the third synthetic resource that ties the two together. Representing our instance and volume this way allows us to add and remove volumes by adding and removing
In many cases, it doesn’t matter which EBS representation you choose. But sometimes making the wrong choice can make changing your infrastructure quite difficult!
We made the wrong choice
We got bitten by this at Heap. We operate a large PostgreSQL cluster in AWS, and each instance has 18 EBS volumes attached for storage. We represented the instances in Terraform as a single
aws_instance resource with the EBS volumes defined in
Our database instances store data on a ZFS filesystem. ZFS lets you dynamically add block devices to grow the filesystem with no downtime. This means we can gradually grow our storage as our customers send us more data while. As an analytics company that captures everything, this flexibility is a huge win. We’re continually improving the insert and query efficiency of our cluster. Instead of being stuck with the CPU-to-storage ratio we picked when we provisioned the cluster, we can adjust the balance on the fly to take advantage of the latest improvements. We’ll go into more detail on how this works in another post.
ebs_block_device blocks got in the way of this process being as smooth as it could be. You might hope that Terraform would let us add a nineteenth
ebs_block_device block to the
aws_instance and everything would just work.
But unfortunately, Terraform sees this as an incompatible change: it doesn’t know how to modify an instance with 18 volumes to turn it into one with 19 volumes. Instead the Terraform plan is to tear down the whole instance and create a new one in its place. This definitely isn’t what we want for our database instances with tens of terabytes of storage!
Until recently, we worked around this, and hackishly got Terraform in sync in a few steps:
we ran a script that used the AWS CLI to create and attach the volumes
we ran terraform refresh to get
Terraformto update its state, and
finally we changed the configuration to match the new reality
Between steps 2 and 3,
terraform plan would show that Terraform wanted to destroy and recreate all our database instances. This made it impossible to do anything with those instances in Terraform until someone updated the config. Needless to say, this is a scary state to end up in routinely!
Terraform state surgery
Once we found the
aws_volume_attachment approach, we decided to switch our representation over. Each volume became two new Terraform resources: an
aws_ebs_volume and an
aws_volume_attachment. For 18 volumes per instance in our cluster, we were looking at well over a thousand new resources. Switching the representation isn’t just a matter of changing the Terraform configuration. We have to reach into Terraform’s state to change how it sees the resources.With over a thousand resources being added, we were definitely not going to do it manually. Terraform’s state is stored as JSON. While the format is stable, the docs state that ‘direct file editing of the state is discouraged’. We had to do it anyway, but we wanted to be sure we were doing it correctly. Rather than reverse-engineer the JSON format by inspection, we wrote a program that uses Terraform’s internals as a library to read, modify, and write it. This wasn’t exactly straightforward, especially since it was the first Go program for both of the people working on it! But we think it was worth it to be sure we weren’t subtly messing up the Terraform state of our database instances.We’ve put the tool up on GitHub in case you find yourself in the same position!
terraform apply is one of the few times you have the power to seriously damage your company’s infrastructure. There are a few things you can do to make this safer and less scary.
Always write your plan
-out, and apply that plan
If you run
terraform plan -out planfile, Terraform will write the plan to
planfile. You can then get exactly that plan to run by running
terraform apply planfile. That way, the changes made at apply time are exactly what Terraform showed you at plan time. You won’t find yourself unexpectedly changing infrastructure that a coworker modified in between your plan and apply.
Take care with the plan file though: it will include your Terraform variables, so if you put secrets in those they will be written to the filesystem in the clear. For example, if you pass in your cloud provider credentials as variables, those will end up stored on disk in plaintext.
Have a read-only IAM role for iterating on changes
When you run
terraform plan, Terraform refreshes its view of your infrastructure. To do this it only needs read access to your cloud provider. By using a read-only role, you can iterate on your config changes and verify them with
terraform plan without ever risking a stray
apply ruining your day — or week!
With AWS, we can manage the IAM roles and their permissions in Terraform. Our role looks like this:
assume_role_policy simply lists the users who are allowed to assume the role.The final piece of this is the policy that gives read only access on all AWS resources.Amazon helpfully provides a copy-pastable policy document, and that’s what we used. We define an
aws_iam_policy that references the policy document:
Then we apply the policy to the
terraform-readonly role with an
Now you can use the Secure Token Service API’s
AssumeRole method to get temporary credentials that only have the power to query AWS, not change it. Running
terraform plan will update your Terraform state to reflect the current infrastructure. If you’re using local state, this means it will write to the
terraform.tfstate file. If you’re using remote state, eg in S3, you’ll need to grant your read-only role write access to the it.
Having this role in place made us much happier when rewriting Terraform’s state to use
aws_volume_attachment for our database volumes. We knew there should be no change to the infrastructure in AWS, only in Terraform’s view of it. With the read-only role
After all, we weren’t actually modifying any infrastructure, so why have that power available?
Ideas for the future
As our team grows, more and more people are making changes to our infrastructure with Terraform. We want to make this easy and safe. Most outages are caused by human error and configuration changes, and applying Terraform changes is a terrifying mix of the two.
For example, with a tiny team, it’s easy to be sure only one person is running Terraform at any given time. With a larger team, that becomes less of a guarantee and more of a hope. If two
terraform apply runs were happening at the same time, the result could be a horrible non-deterministic mess. While we’re not using it just yet, Terraform 0.9 introduced state locking, making it possible to guarantee only one
terraform apply is happening at a time.
Another place where we’re thinking about ease and safety is in reviewing infrastructure changes. Right now our review process involves copy/pasting
terraform plan output as a comment on the review, and applying it manually once it’s approved.
We’re already using our continuous integration tool to validate the Terraform configuration. For now this just runs
terraform validate, which checks for syntax errors. The next step we want to work towards is having our continuous integration run
terraform plan and post the infrastructure changes as a comment in code review. The CI system would automatically run
terraform apply when the change is approved. This removes a manual step, while also providing a more consistent audit trail of changes in the review comments. Terraform Enterprise has a feature like this, and we’ll be taking a look at it.