Saturday, September 21, 2013

Health check parameters

Recently I got questions on pgpool-II's health check parameters. In this article I will try to explain them.

"Health check" is a term used in pgpool-II. Pgpool-II occasionally checks if PostgreSQL is alive or not by connecting to it and we call it "health check".
There are four parameters to control the behavior of the health check.

health_check_period

 This parameter defines the interval between the health check in seconds. If set to 0, the health check is disabled. The default is 0.

health_check_timeout

This parameter controls the timeout before giving up the connecting attempt to PostgreSQL in seconds. The default is 20. Pgpool-II uses socket access system calls such as connect(), read(), write() and close(). These system calls could hang if the network connection between pgpool-II and PostgreSQL is broken, and the hung could last until the TCP stack in the kernel gives up. This could be as long as two hours in most operating systems.  Apparently this is not good. The solution is setting a timeout before calling those system calls: health_check_timeout. Please note that health_check_timeout must be shorter enough than health_check_period. For example, If health_check_timeout is 20, health_check_period should be 30 or more.

health_check_max_retries

health_check_retry_delay

Sometimes network connections can be temporary unstable for various reasons. If health_check_max_retries is greater than 0, pgpool-II tries to repeat the health check up to health_check_max_retries times or succeeded in the health check. Interval between each retry is defined by health_check_retry_delay. The default for health_check_max_retries is 0, which disables the retry. The default for health_check_retry_delay is 1 (second).

Please note that "health_check_max_retries * (health_check_timeout+health_check_retry_delay)" should be smaller than health_check_period.

Following setting satisifes the formula.

health_check_period = 40
health_check_timeout = 10
health_check_max_retries = 3
health_check_retry_delay = 1

Please refer to pgpool-II document for more details.
http://www.pgpool.net/mediawiki/index.php/Documentation#Official_documentation

3 comments:

  1. Yesterday I attended Brookings Papers on Economic Activity to watch some of the smartest economists in America debate some of the most interesting papers. www.clearskinmaxsolution.org

    ReplyDelete
  2. The Health Check monitor collects QC1 parameters and their scores. It does not aim at completeness but is focusing on the critical instrument .stop pot

    ReplyDelete
  3. Hi Ishii,

    We have configured pgpool 3.3 for our PostgreSQL db in aws. And using the healthcheck parameters as below.

    health_check_period = 40
    health_check_timeout = 10
    health_check_max_retries = 2
    health_check_retry_delay = 2

    I am trying to test this by using an iptables rule in the active server. When I add a drop rule to the PostgreSQL db port it does trigger a timeout but I think it timeout too soon. I.e. according to my understanding it should wait for the time period specified in "health_check_timeout " in this case, but actually times out in less than a second.(Please see the log below). Shouldn't it wait for 10 secs before time out or is my understanding incorrect

    2014-05-13 12:14:43 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:44 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:44 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:44 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:44 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:45 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:45 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:45 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:45 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:45 LOG: pid 15822: health check retry sleep time: 2 second(s)
    2014-05-13 12:14:46 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:47 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:47 DEBUG: pid 15822: retrying 1 th health checking
    2014-05-13 12:14:47 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:47 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:48 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:48 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:48 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:48 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:48 LOG: pid 15822: health check retry sleep time: 2 second(s)
    2014-05-13 12:14:49 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:50 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:50 DEBUG: pid 15822: retrying 2 th health checking
    2014-05-13 12:14:50 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:50 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:51 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:51 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:51 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:51 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:51 LOG: pid 15822: set 0 th backend down status
    2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler called
    2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler: starting to select new master node



    ReplyDelete