Monday, June 20, 2016

Avoiding connection failure while health checking

Hi pgpool-II users,

I've managed to come back to this blog after 3 years.

Last week pgpool-II developer team released minor versions from 3.1 to 3.5. This release includes a special gift for users: enhancement for health checking.

You might notice an annoying behavior of pgpool-II.

Example: suppose we have three PostgreSQL backends managed by pgpool-II. pgpool-II occasionally checks the health of each backend if "health check" is enabled. If backend #2 goes down, a fail over is triggered and after that, users can use the DB server cluster without backend #2. This is great. However, if you set up the retrying of health checking, clients cannot connect to pgpool-II while it is retrying the health check. For instance,

health_check_max_retries = 10
health_check_retry_delay = 6

will continue the health check retry for 10*6 = 60 seconds at least. This is very annoying for users because they cannot initiate new connections to pgpool-II for 1 minute. Making these parameters shorter might mitigate the situation a little bit but this may not be useful if the network is not stable and longer retries are desirable.

These new releases significantly enhance the situation. By setting:

fail_over_on_backend_error = off

when a user connects to pgpool-II while it is health checking, it starts to connect to all backends including #2. Before this release, pgpool-II gave up initiating session if one of backend is not available (in this case #2). With this release, pgpool-II skips the broken backend and continues to connect to the rest of the backends. Please note, however, this feature is only available when all of the conditions below are met:

  • streaming replication mode
  • the broken backend is not primary server
  • fail_over_on_backend_error  is off

This enhancement is  available in all of the new releases: 3.5.3, 3.4.7, 3.3.11, 3.2.16, 3.1.19.