Unlock 2025 Benchmark data → Access insights to stay ahead in the digital experience race.

Get the Report
skip to content
Loading...
    • Why Product Analytics And what can it do for you?
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • Product Analytics + Digital Experience Analytics A deeper dive
    • The Future of Insights A comic book guide
    Watch a Demo
  • Data Insights

    • Journeys Visual maps of all user flows
    • Sense AI Analytics for everyone
    • Web Analytics Integrate key web metrics
    • Session Replay Complete context with a single click
    • Heatmaps Visualize user behavior instantly
    • Heap Illuminate Data science that pinpoints unknown friction

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • Retail and eComm Increase purchases and order value
    • Healthcare Build better digital patient experiences
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Professional Services

    Resources

    • Down the Funnel Our complete blog and content library
    • Webinars & Events Events and webinar recordings
    • Press News from and about Heap
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Stories from over 9,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

How We Saved Millions in SSD Costs by Upgrading Our Filesystem

James Katz
October 29, 20215 min read
  • Facebook
  • Twitter
  • LinkedIn

At Heap, we use a multi-petabyte cluster of Postgres instances to support a wide variety of analytical queries for our customers. During COVID we experienced rapid growth in the amount of data we ingest, due to business growth and a global increase in online traffic. This post details some of the problems this caused us and how we solved them.

Performance problems due to rapid data growth

Earlier this year, we noticed that our database nodes were filling up faster than expected. As they passed 80% utilization, we started to experience performance degradation in our ingestion pipeline. We were struggling to keep up with incoming data, particularly when other write intensive workloads were occurring on the cluster (cluster maintenance operations, index builds, etc).

The root cause of our issues was that SSDs tend to experience performance degradation as utilization approaches 100%. For us, this issue is exacerbated by the fact that we use the ZFS filesystem. ZFS is a copy-on-write (CoW) filesystem, which means that whenever a block on disk is updated, ZFS will write a new copy of the block, instead of updating it in place.

Copy on write block animation

CoW semantics facilitate many interesting features, including:

  • Filesystem-level compression - Compression is more difficult to make work in traditional filesystem implementations since the size of a block can change after modifications. We get 4-5x compression (more on this later), which saves us petabytes in storage space.

  • Higher durability - This allows us to disable full_page_writes in Postgres, since torn pages are not possible after a crash. This improves write performance and decreases overall IO.

  • Consistent point-in-time snapshots - Since pages are effectively immutable, the filesystem can hold onto “out of date” pages during snapshot operations to guarantee consistency. Traditional filesystems have no such functionality,* and tools like rsync cannot do point in time snapshots.

However, CoW filesystems like ZFS can experience worse-than-normal write performance degradation as drives fill up. Because the block allocator must find an empty block to write to for every page update, the performance penalty at high utilization can be severe, due to needing to delete previously unlinked blocks and shuffle around existing blocks. It’s exacerbated for our use case because we set the blocksize relatively high at 64 kb to obtain a higher compression ratio. For these reasons, it’s generally recommended not to let ZFS go past 80% disk utilization.

At this point we knew we had to do something to decrease disk utilization across the cluster. Ordinarily we’d add more nodes to our cluster. In this case, we wanted to try a simpler idea we’d investigated earlier in the year first, in the hopes we could avoid increasing our AWS bill.

A possible compression-based solution

Earlier in 2021 we experimented with upgrading to ZFS 2.x so we could use Zstandard compression. We were using lz4 at the time, which is incredibly fast, and gave us a compression ratio of ~4.4x. Our experiments with Zstandard showed that we could increase our compression ratio to ~5.5x if needed. This amounts to a ~20% decrease in disk usage, which would have been a massive boon to our disk usage issues, not to mention a significant long-term reduction in our AWS bill.

However, we had some performance concerns at the time. Early query performance comparisons between the two configurations were inconclusive. Zstandard is slower in most benchmarks than lz4,** so we wanted to be sure we weren’t going to slow down queries for our customers by rolling this out. We decided to proceed with a rigorous test phase with the goal of determining whether or not we could safely roll out Zstandard compression without causing a performance regression.

Testing prior to rollout

Changing to Zstandard compression impacts read performance in a couple ways:

  • Less IO - Each ZFS record (configured via ZFS recordsize) compresses to fewer hardware blocks to read (configured via ZFS ashift).

  • Slower decompression - Zstandard is slower than lz4, so the amount of time to decompress a given record is generally slower.

It was difficult to predict the combined impact of these effects on our production workload confidently. The net impact depended on the aggregate query workload across our customers as well as various other sources of contention in our production environment. Thus we decided to use our infrastructure for testing database performance changes (described in this blog post).

With this infrastructure we restored a backup onto a control machine using our existing configuration and a test machine using ZFS 2.x with Zstandard compression enabled. These machines comprised a sample of roughly 1% of our full production dataset and read/write workloads. After mirroring production writes and reads for several days, we looked at the data and found that compared to the control, machine query performance was basically unchanged, storage utilization decreased by ~20% as expected, and write query duration halved on average. Thus, we were ready to roll this out.

Results

Our database cluster is broken up into cells of five nodes, each within an autoscaling group (ASG). Replacing a node is as simple as detaching it from its ASG. The ASG then provisions a new node, which automatically restores the latest backup from the detached node and enters warm standby mode. At this point we promote the new node to primary and destroy the old node.

To roll out this change we created a new AMI for our database instances with the new configuration baked in. Then, we executed a rolling replacement of all nodes via backup restore using the above mechanism. When all was said and done we observed the following impact:

  • Total storage usage reduced by ~21% (for our dataset, this is on the order of petabytes)

  • Average write operation duration decreased by 50% on our fullest machines

  • No observable query performance effects

This exactly matched the experimental results from our database test environment.

This project exemplified how at scale, small configuration changes can have a massive impact. It also showed the benefits of building off of free open source software — with open source, you can get massive wins for free. Rolling out this change was a lot of work but compared to the benefit — the whole Heap dataset getting 20% smaller — it's very low-hanging, and that's only because someone else did most of the work for us.

If you find this kind of work interesting, we’re hiring! Check out our engineering team and open roles.

*It’s possible to use LVM (Logical Volume Manager) to obtain this functionality with xfs/ext4. LVM’s snapshot capability works similarly to ZFS, by monitoring changes to on-disk blocks and creating a copy of the previous contents to a snapshot volume whenever a block changes.

**See Facebook’s benchmarks of various compression algorithms.

James Katz

Was this helpful?
PreviousNext

Related Stories

See All

  • Creative visualization of AI CoPilot capability
    article

    Heap announces new generative AI CoPilot

    Heap, the leader in product analytics, unveils AI CoPilot’s open beta today.

  • Heap.io
    article

    What’s Next in Experience Analytics?

    What does the future of analytics hold, and what does it mean for you?

  • Heap.io
    article

    Building a Retention Strategy, Part 2: Connecting Activities to Revenue with a Metrics Tree

    If you read one post from this series, it should be this one.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Heap Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • Retail and eComm
  • Financial Services
  • Why Heap
  • Why Product Analytics
  • How Heap Works
  • How Heap Compares
  • ROI Calculator
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • DEI
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
    • Twitter
    • Facebook
    • LinkedIn
    • YouTube

© 2025 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust