At Heap, we use a multi-petabyte cluster of Postgres instances to support a wide variety of analytical queries for our customers. During COVID we experienced rapid growth in the amount of data we ingest, due to business growth and a global increase in online traffic. This post details some of the problems this caused us and how we solved them.
Performance problems due to rapid data growth
Earlier this year, we noticed that our database nodes were filling up faster than expected. As they passed 80% utilization, we started to experience performance degradation in our ingestion pipeline. We were struggling to keep up with incoming data, particularly when other write intensive workloads were occurring on the cluster (cluster maintenance operations, index builds, etc).
The root cause of our issues was that SSDs tend to experience performance degradation as utilization approaches 100%. For us, this issue is exacerbated by the fact that we use the ZFS filesystem. ZFS is a copy-on-write (CoW) filesystem, which means that whenever a block on disk is updated, ZFS will write a new copy of the block, instead of updating it in place.
CoW semantics facilitate many interesting features, including:
Filesystem-level compression - Compression is more difficult to make work in traditional filesystem implementations since the size of a block can change after modifications. We get 4-5x compression (more on this later), which saves us petabytes in storage space.
Higher durability - This allows us to disable full_page_writes in Postgres, since torn pages are not possible after a crash. This improves write performance and decreases overall IO.
Consistent point-in-time snapshots - Since pages are effectively immutable, the filesystem can hold onto “out of date” pages during snapshot operations to guarantee consistency. Traditional filesystems have no such functionality,* and tools like rsync cannot do point in time snapshots.
However, CoW filesystems like ZFS can experience worse-than-normal write performance degradation as drives fill up. Because the block allocator must find an empty block to write to for every page update, the performance penalty at high utilization can be severe, due to needing to delete previously unlinked blocks and shuffle around existing blocks. It’s exacerbated for our use case because we set the blocksize relatively high at 64 kb to obtain a higher compression ratio. For these reasons, it’s generally recommended not to let ZFS go past 80% disk utilization.
At this point we knew we had to do something to decrease disk utilization across the cluster. Ordinarily we’d add more nodes to our cluster. In this case, we wanted to try a simpler idea we’d investigated earlier in the year first, in the hopes we could avoid increasing our AWS bill.
A possible compression-based solution
Earlier in 2021 we experimented with upgrading to ZFS 2.x so we could use Zstandard compression. We were using lz4 at the time, which is incredibly fast, and gave us a compression ratio of ~4.4x. Our experiments with Zstandard showed that we could increase our compression ratio to ~5.5x if needed. This amounts to a ~20% decrease in disk usage, which would have been a massive boon to our disk usage issues, not to mention a significant long-term reduction in our AWS bill.
However, we had some performance concerns at the time. Early query performance comparisons between the two configurations were inconclusive. Zstandard is slower in most benchmarks than lz4,** so we wanted to be sure we weren’t going to slow down queries for our customers by rolling this out. We decided to proceed with a rigorous test phase with the goal of determining whether or not we could safely roll out Zstandard compression without causing a performance regression.
Testing prior to rollout
Changing to Zstandard compression impacts read performance in a couple ways:
Less IO - Each ZFS record (configured via ZFS recordsize) compresses to fewer hardware blocks to read (configured via ZFS ashift).
Slower decompression - Zstandard is slower than lz4, so the amount of time to decompress a given record is generally slower.
It was difficult to predict the combined impact of these effects on our production workload confidently. The net impact depended on the aggregate query workload across our customers as well as various other sources of contention in our production environment. Thus we decided to use our infrastructure for testing database performance changes (described in this blog post).
With this infrastructure we restored a backup onto a control machine using our existing configuration and a test machine using ZFS 2.x with Zstandard compression enabled. These machines comprised a sample of roughly 1% of our full production dataset and read/write workloads. After mirroring production writes and reads for several days, we looked at the data and found that compared to the control, machine query performance was basically unchanged, storage utilization decreased by ~20% as expected, and write query duration halved on average. Thus, we were ready to roll this out.
Our database cluster is broken up into cells of five nodes, each within an autoscaling group (ASG). Replacing a node is as simple as detaching it from its ASG. The ASG then provisions a new node, which automatically restores the latest backup from the detached node and enters warm standby mode. At this point we promote the new node to primary and destroy the old node.
To roll out this change we created a new AMI for our database instances with the new configuration baked in. Then, we executed a rolling replacement of all nodes via backup restore using the above mechanism. When all was said and done we observed the following impact:
Total storage usage reduced by ~21% (for our dataset, this is on the order of petabytes)
Average write operation duration decreased by 50% on our fullest machines
No observable query performance effects
This exactly matched the experimental results from our database test environment.
This project exemplified how at scale, small configuration changes can have a massive impact. It also showed the benefits of building off of free open source software — with open source, you can get massive wins for free. Rolling out this change was a lot of work but compared to the benefit — the whole Heap dataset getting 20% smaller — it's very low-hanging, and that's only because someone else did most of the work for us.
*It’s possible to use LVM (Logical Volume Manager) to obtain this functionality with xfs/ext4. LVM’s snapshot capability works similarly to ZFS, by monitoring changes to on-disk blocks and creating a copy of the previous contents to a snapshot volume whenever a block changes.
**See Facebook’s benchmarks of various compression algorithms.