Saturday, September 21, 2013

Health check parameters

Recently I got questions on pgpool-II's health check parameters. In this article I will try to explain them.

"Health check" is a term used in pgpool-II. Pgpool-II occasionally checks if PostgreSQL is alive or not by connecting to it and we call it "health check".
There are four parameters to control the behavior of the health check.

health_check_period

 This parameter defines the interval between the health check in seconds. If set to 0, the health check is disabled. The default is 0.

health_check_timeout

This parameter controls the timeout before giving up the connecting attempt to PostgreSQL in seconds. The default is 20. Pgpool-II uses socket access system calls such as connect(), read(), write() and close(). These system calls could hang if the network connection between pgpool-II and PostgreSQL is broken, and the hung could last until the TCP stack in the kernel gives up. This could be as long as two hours in most operating systems.  Apparently this is not good. The solution is setting a timeout before calling those system calls: health_check_timeout. Please note that health_check_timeout must be shorter enough than health_check_period. For example, If health_check_timeout is 20, health_check_period should be 30 or more.

health_check_max_retries

health_check_retry_delay

Sometimes network connections can be temporary unstable for various reasons. If health_check_max_retries is greater than 0, pgpool-II tries to repeat the health check up to health_check_max_retries times or succeeded in the health check. Interval between each retry is defined by health_check_retry_delay. The default for health_check_max_retries is 0, which disables the retry. The default for health_check_retry_delay is 1 (second).

Please note that "health_check_max_retries * (health_check_timeout+health_check_retry_delay)" should be smaller than health_check_period.

Following setting satisifes the formula.

health_check_period = 40
health_check_timeout = 10
health_check_max_retries = 3
health_check_retry_delay = 1

Please refer to pgpool-II document for more details.
http://www.pgpool.net/mediawiki/index.php/Documentation#Official_documentation

10 comments:

  1. The Health Check monitor collects QC1 parameters and their scores. It does not aim at completeness but is focusing on the critical instrument .stop pot

    ReplyDelete
  2. Hi Ishii,

    We have configured pgpool 3.3 for our PostgreSQL db in aws. And using the healthcheck parameters as below.

    health_check_period = 40
    health_check_timeout = 10
    health_check_max_retries = 2
    health_check_retry_delay = 2

    I am trying to test this by using an iptables rule in the active server. When I add a drop rule to the PostgreSQL db port it does trigger a timeout but I think it timeout too soon. I.e. according to my understanding it should wait for the time period specified in "health_check_timeout " in this case, but actually times out in less than a second.(Please see the log below). Shouldn't it wait for 10 secs before time out or is my understanding incorrect

    2014-05-13 12:14:43 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:44 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:44 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:44 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:44 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:45 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:45 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:45 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:45 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:45 LOG: pid 15822: health check retry sleep time: 2 second(s)
    2014-05-13 12:14:46 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:47 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:47 DEBUG: pid 15822: retrying 1 th health checking
    2014-05-13 12:14:47 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:47 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:48 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:48 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:48 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:48 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:48 LOG: pid 15822: health check retry sleep time: 2 second(s)
    2014-05-13 12:14:49 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:50 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:50 DEBUG: pid 15822: retrying 2 th health checking
    2014-05-13 12:14:50 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:50 DEBUG: pid 15822: health_check: 0 th DB node status: 1
    2014-05-13 12:14:51 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
    2014-05-13 12:14:51 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
    2014-05-13 12:14:51 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
    2014-05-13 12:14:51 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
    2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
    2014-05-13 12:14:51 LOG: pid 15822: set 0 th backend down status
    2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler called
    2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler: starting to select new master node



    ReplyDelete
  3. Hi we have postgres 9.3.4 with pgpool 3.3 configured. we are facing the below errors and the pgpool goes out of sync though the streaming replication works fine:

    :2014-11-10 02:47:25 ERROR: pid 26332: connect_inet_domain_socket: select() timed out
    2:2014-11-10 02:47:25 ERROR: pid 26332: make_persistent_db_connection: connection to node0(5432) failed
    3:2014-11-10 02:47:25 ERROR: pid 26332: health check failed. 0 th host node0 at port 5432 is down

    the health_check_timeout is set to 15seconds and the system admin confirmed that there is network latency between the 2 nodes. there are no iptable rules set as well
    Any idea why this issue is happening quiet frequenbtly and anything can be done to fix it permamently?

    Thanks Karthick

    ReplyDelete
    Replies
    1. I guess you hit a bug of earlier version of pgpool-II 3.3. The problem was non blocking connect(2) sometimes takes long time before establishing the connection.
      The modern version of 3.3 series overcomes the problem by just increasing the internal timeout parameter (which is different from heath_check_timeout) to 10 seconds, and hopefully that is the long enough for such an network.
      Please upgrade to the latest version of pgpool-II 3.3 series.

      pgpool-II 3.4 has new parameter to adjust the internal timeout value BTW.

      Delete
  4. Thanks for the detailed explanation. I did want to ask, what would happen if the formula wasn't met properly -- i.e. health_check_period == health_check_timeout with a retry of 2 and retry delay of 1?

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. Thanks for the information.I will keep this in mind.And be more attentive about it.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Hello! Did I understand correctly from your post that pjpool hangs in case of loss of connection to the database?
    This is very important for me to know.
    I would be grateful for your answer!
    Looking forward to your response.
    With best wishes, Nikolay.

    ReplyDelete
  9. > pjpool hangs in case of loss of connection to the database?
    Yes. While pgpool's health check is retrying to confirm the connection to PostgreSQL, new connection to Pgpool will be suspended.

    ReplyDelete

Dynamic spare process management in Pgpool-II

Pre-fork architecture in Pgpool-II Pgpool-II uses fixed number of pre-forked child process which is responsible for accepting and handling e...