At Heap, we lean on PostgreSQL for most of the backend heavy lifting. We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.
In this post, we’ll take a look at a PostgreSQL function that hangs on large inputs and rewrite it in an efficient, idiomatic way.
If you’re writing a PL/pgSQL function that returns an array, it can be tempting to create an empty array and build up results in a loop with
array_cat. But, as is often the case, procedural idioms in a relational database result in bad times.
Consider an example function in which we create an array of hstores such that the entry at position
Looks simple enough, but this is bad news. This takes a quadratic amount of time to run, blowing up to 36 seconds to generate an array of 100k elements.
What’s going on here? It turns out the repeated calls to
array_append cause this quadratic explosion. When we call
result := array_append(result, ...), PostgreSQL allocates a new array that’s wide enough for the result of the
array_append call and then copies the data in. That is,
array_append(array, new_element) is linear in the length of
array, which makes the implementation above O(N2).
A lot of languages handle this idiom more gracefully. A common strategy is to resize the array that backs a list to double the existing size. With a list written this way, repeated appends would only require us to execute the “grow the array and copy over your data” operation a logarithmic number of times, and our amortized runtime would be linear.
So, PostgreSQL could be smarter, but this is not an idiomatic implementation, so we shouldn’t expect it to be. The correct way to do this is with
array_agg — an aggregate function that takes a set and returns all of the entries as an array.
This is fast, and it scales linearly. It takes 300 ms to generate an array of 100k elements — a 100x reduction. This query allows PostgreSQL to generate the complete set of results before materializing any arrays, so it doesn’t need to do so more than once. PostgreSQL can compute upfront exactly how long the resulting array needs to be and only needs to do one allocation.
Lesson learned: if you find yourself calling array_append or array_cat in a loop, use array_agg instead.
When you’re working with any relational database, you should reconsider any procedural code you find yourself writing. Also, it helps to have an intuition for how long something “should” take. Generating a 100k element array (with around one megabyte of total data) shouldn’t take thirty seconds, and, indeed, it doesn’t need to.
We like learning stuff. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.
 In particular, we use a lovely tool called Citus Data. More on that in another blog post!
 See: https://heap.io/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.