Decrypting PgBouncer's Diagnostic Information
If you use Postgres at scale, at some point you’ll need a connection pooler. Postgres lets you configure a maximum number of concurrent queries via max_connections, but if you need to handle bursts of more than a few dozen connections at a time, you probably don’t want to provision a lot of connection slots to do so – this decreases the memory available per-connection and will lead to worse performance.
Instead, you typically use a connection pooler, which multiplexes a low number of underlying database connections and presents an interface that appears the same to a client. This way, your app can send bursts of thousands of queries at a time, and they’ll queue up in your connection pooler instead of swamping your database. A configuration in which a fixed, low number of database connections processes incoming queries will perform much better than one that allocates a huge number of connections that are only occasionally used.
We have been using PgBouncer as a connection pooler since 2014, and we love it. It’s easy to deploy and configure, and it presents an interface to database clients that is wire-compatible, so you can roll it out without changing your application code!
Monitoring PgBouncer
PgBouncer exposes some pseudo-tables for debugging and monitoring purposes. You can access the information they expose by connecting to a special pseudo-database and running diagnostic commands. One of the most important such diagnostics is SHOW POOLS
, which you can use to understand the statuses of the connection pools in your PgBouncer instance.
Unfortunately, the output of SHOW POOLS
can look somewhat cryptic, and this is aggravated by the lack of documentation about it. This was making it hard for us to debug some PgBouncer-related issues – there was even some mismatch between how different people on our team understood the same metrics. That’s why I decided to dive into the source code and try and make sense of the data provided.
Overview Of Pools
This is the output from the command SHOW POOLS
when executed on one of our database nodes:
What do all these numbers mean? What should we be looking for? Do any invariants hold between them at all times? Say, is cl\_active
< cl\_waiting
, whatever these numbers mean? These are the kinds of questions that we’ll try to answer in this post.
Let’s start with some basic definitions.
pool
– The docs never really define what a pool is, but you can think of it as a set of connections to the same database for a given user.client connection
– A client connection is a connection from some client to PgBouncer. In our case, a client could be our data ingestion code, a background job, or a customer query.server connection
– A server connection is a connection from PgBouncer to Postgres, meaning that each server connection in PgBouncer has a corresponding Postgres connection. This is not true for client connections.CL\_ACTIVE
vscl\_active
– as confusing as it might be, some of the terms in this post are case-sensitive since this is the way the PgBouncer code uses them. When you see lowercasecl\_active
, this refers to the number thatSHOW POOLS
reports (say, 2928 for the last pool in the screenshot).CL\_ACTIVE
, on the other hand, is a state that a connection can be in, defined in the[SocketState]
enum ininclude/bouncer.h
. While there is usually a one-to-one correspondence between the lowercase number and the uppercase state, that’s not always the case. (See the section oncl\_waiting
for an example.)
So client connections are pretty lightweight — almost free aside from a file descriptor. Server connections are heavy and need to be configured appropriately. Supporting lots of client connections with relatively few server connections is our main use case for PgBouncer.
Client Connections
There are 7 states that a client connection can be in, but we’ll discuss the two that SHOW POOLS
informs us about as they are the most important from an administrative point of view.
Here is a state machine diagram of the different PgBouncer client states:
State machine transitions legend:
A client tries to execute a query; a server connection is available to pair the client with and the query starts executing.
A client tries to execute a query; a server connection is not available, so the client enters a waiting state.
A server connection becomes available; a client is moved to an active state, from which it can then pair with the server.
A client connection has successfully finished logging into PgBouncer.
The query executing on a client connection completes; the client is moved to the connected state (which, in specific PgBouncer terms, means that the link property is cleared) and the server connection becomes available for other clients.
The states work as follows:
CL\_LOGIN
– this is the state for client connections that are currently in the process of logging in to PgBouncer. While there is only one actual PgBouncer state for this, the clients can be in two different sub-states depending on whether they are currently running a query against Postgres – we’ll call them CL\_LOGIN(IN\_PROGRESS)
and CL\_LOGIN(EXECUTING)
.
(Wondering why a client that is still logging in would be executing a query? Well, if the auth\_user property
is configured for the database the client is connecting to, PgBouncer may need to execute a query against pg\_shadow
to execute login checks. Since pg\_shadow
is a Postgres table, a server connection to Postgres is required in order to run queries on it.)
CL\_ACTIVE
– the “active client” state means that there is an established connection from a client to PgBouncer. Some clients in this state are executing queries, but others may not be. For ease of understanding, I have broken this up into a CL\_ACTIVE(EXECUTING)
state and a CL\_ACTIVE(CONNECTED
) state, but this split is not explicit in the PgBouncer codebase.
You can distinguish between clients in these two CL\_ACTIVE
sub-states by running SHOW CLIENTS
. If the link
value is set for a given connection, that means that the client connection is paired with a server connection, i.e. is in CL\_ACTIVE(EXECUTING)
. If the pool\_mode
for the corresponding pool is transaction
or statement
, that means the client is currently executing a transaction (or statement, respectively); if it is session
, then the client will stay connected to a server connection (occupying a Postgres connection) even if it’s not currently executing any queries.
CL\_WAITING
and CL\_WAITING\_LOGIN
– the number displayed in SHOW POOLS
as cl\_waiting
actually combines the counts for clients in these two states. In order for a client connection to enter one of these waiting states, the following sequence of events must happen:
A client has connected to PgBouncer and is in state
CL\_ACTIVE(CONNECTED)
orCL\_LOGIN(IN\_PROGRESS)
(it’s not connected to Postgres yet. That is, it’s not paired with a server connection.)The client tries to execute a query. This requires PgBouncer to find an available server connection and pair it with our client connection. (In terms of states, this means it tries to enter the
CL\_ACTIVE(EXECUTING)
orCL\_LOGIN(EXECUTING)
states – both of which are states we define in this post for clarity only.)PgBouncer fails to find an available server connection. This could happen, for example, if the maximum configured number of connections to Postgres has already been exhausted.
The client connection enters the
CL\_WAITING
(orCL\_WAITING\_LOGIN
, respectively) state.
This is represented by a transition of type 2. in the diagram.
So what happens when a client enters a waiting state? Unsurprisingly, it starts waiting for a server connection to be available. If creating more server connections is allowed (the definition of “allowed” is briefly discussed in the section on Configurable Limits below), PgBouncer will periodically create new ones, and when it does so, it will pick a client that is in a waiting state and activate it. Alternatively, when a client that is currently CL\_EXECUTING
(or CL\_EXECUTING\_LOGIN
) finishes executing a query, the corresponding server connection is freed and a waiting client is activated again.
Here are some conclusions we can draw from the above definitions:
Having a large
cl\_active
value is not a reason for concern by itself. This is because, as discussed above, this represents the number of clients that have successfully logged in to PgBouncer, not the number of clients that are running a query. The only clients running queries are the ones we deemedCL\_EXECUTING
, or with
link
set inSHOW CLIENTS
.Having a large
cl\_waiting
value can be indicative of problems. If the large value persists for a long time, this could mean that there are consistently more clients trying to execute queries in Postgres than we’ve allowed, or that there are long-running queries using up all the server connections.There’s no direct connection between the values for
cl\_active
andcl\_waiting
– they account for non-overlapping states of client connections.
Server Connections
SV\_ACTIVE
– A server connection (that is, one from PgBouncer to Postgres) being active means that it’s established and paired with a client connection. Note that this doesn’t necessarily mean that it’s actively executing a query. In the session pooling mode, the server connection will stay active until the client connected to PgBouncer explicitly disconnects.Unlike active client connections, an
SV\_ACTIVE
server connection is always paired with a client connection and haslink
set.SV\_IDLE
– A connection from PgBouncer to Postgres that is established and ready to serve clients, and is not currently paired with a client connection.SV\_USED
– PgBouncer supports checking connections to Postgres regularly to make sure they are still alive. After a server connection has been idle for longer than**server\_check\_delay**
, it’s moved to this state, indicating that it needs to be checked before using it again.SV\_TESTED
– This state means that the server connection is currently being checked (or will be checked immediately). (Compared to the above,SV\_USED
only indicates that the connection should be checked, whereasSV\_TESTED
indicates that it is being checked right now.)SV\_LOGIN
– A server connection currently trying to connect to Postgres.
Some more thoughts we can get from the above:
cl\_executing + cl\_executing\_login == sv\_active
– if there was a count reported for clients in our made-upCL\_EXECUTING
(andCL\_EXECUTING\_LOGIN
) state, it would be equal to the number of active server connections. This is because, by definition, a client connection in an executing state is paired with a server connection; and there are no other clients paired with a server connection.
Is
sv\_active <= cl\_active
? That is, are we guaranteed that the number of active server connections is less than or equal to the number of active client connections? That’s usually the case, but not always. As mentioned immediately above (and back in the discussion onCL\_WAITING\_LOGIN
), it’s possible for a client connection in theCL\_LOGIN
state to be paired with a server connection and be executing a query. Thus, it’s possible, for example, to have all active clients executing a query, and aCL\_LOGIN
client executing one as well. This will mean thatsv\_active = cl\_active + 1
, making our inequality above not hold in certain cases.
Configurable Limits
PgBouncer comes with a lot of tunables. Here is a brief overview of the most important ones related to connection pool sizing.
pool\_size
– This setting controls how many server connections can be created for a given pool. Note that there can be multiple pools for a given database. A pool is created for each different user connecting to each different database defined in the [databases
] section ofpgbouncer.ini
. (If you specify the user for a given database explicitly in[databases]
, it will have a single pool only.) This means thatsum(sv\_*) <= pool\_size
,where sv\_*
are thesv\_active, sv\_idle
, etc. numbers reported for a given pool, andpool\_size
is the respective pool size limit.max\_db\_connections
– total maximum number of server connections for a given PgBouncer database.max\_user\_connections
– total maximum number of server connections for a given user.max\_client\_conn
– The maximum number of client connections to PgBouncer – that is, the sum ofcl\_\*
for all pools is less than or equal tomax\_client\_con
n. There doesn’t seem to be a setting that allows controlling this at the pool level.
Failure Scenarios
Let’s look at a few different ways PgBouncer might be misbehaving, and how the above definitions could help us figure out what’s going on.
High cl_waiting, high sv_active – Let’s start with a situation where sv\_active
is close to the pool\_size
. As discussed in the section on CL\_WAITING
, there are two ways a client connection can get out of a waiting state – either by PgBouncer creating more server connections to Postgres, or when a connection in the CL\_EXECUTING
(made-up) state finishes executing its query. Since sv\_active
is high (and there are probably other server connections as well), PgBouncer can’t just create more server connections and grant them to clients. Instead, it needs to wait for running queries to complete, and then grant the now free server connections. Therefore, this might be indicative of some of the following problems:
Queries are running for too long, occupying server connections and blocking queued clients from connecting. This could be because of a temporary spike, or someone running a bunch of huge queries at the same time. If it persists for a long time, the queued client connections might start timing out. Possible ways to solve this would be to set a strict query timeout to prevent outliers from taking up all resources, or designing your queries so none of them are significantly slower than the others.
You constantly have more queries coming into PgBouncer than Postgres can handle. If every one of your queries takes 1 second to complete, and you have 120 queries coming in per second with a
pool\_size
of 100, you’ll have a backlog of queries growing at a rate of 20 queries per second. You’ll usually need to tune your queries or Postgres to solve this.
High cl_waiting, low sv_active – I.e., the same scenario as above, except sv\_active
is low compared to pool\_size
. If this happens for a short time period, it might just mean that your load is spiky and PgBouncer can’t create new connections to Postgres fast enough. You might want to tweak min\_pool\_size
to make sure your server connections don’t fall too low even if no clients are executing queries at any given moment.
If this situation is prolonged, there might be something preventing PgBouncer from creating new server connections. To give some examples:
The sum of the
pool\_sizes
for all pools for a given database is larger thanmax\_db\_connections
for the same database. Then, even though there appear to be free slots for a pool, PgBouncer can’t really create more connections.Postgres is configured with
max\_connections
lower than thepool\_size
– or, alternatively, there are clients connecting directly to Postgres that are not accounted for insv\_active
.
High cl_active – This is not, by itself, a sign of problems. PgBouncer should be capable of servicing many thousands of clients being connected at the same time, since as mentioned above, they are not necessarily connected to a server connection.
Code Organization
This is a brief overview of the code structure in case anyone is curious to dive deeper.
include/bouncer.h
– the different connection states are defined here.src/objects.c
– most of the interesting stuff happens here. This is where clients are paired with servers, clients are activated when a server becomes available, and so forth.src/client.c
andsrc/server.c
– there are some client- and server-specific functionalities in these, respectively – for example, handling client authentication, or parsing client / server queries and responses.src/janitor.c
– the code for a number of periodic cleanup tasks: disconnecting idle server connections, rechecking connections, etc.src/takeover.c
– PgBouncer allows you to hot-reload your configuration, without dropping connections. This files contains the logic for doing so.
If you like getting deep into the guts of the tools you use, we are hiring! Or, if you have any questions, feel free to reach out to me on Twitter: @IvanVergiliev.