Playing with PostgreSQL and Pgpool: Health check parameters

Saturday, September 21, 2013

Health check parameters

Recently I got questions on pgpool-II's health check parameters. In this article I will try to explain them.

"Health check" is a term used in pgpool-II. Pgpool-II occasionally checks if PostgreSQL is alive or not by connecting to it and we call it "health check".
There are four parameters to control the behavior of the health check.

health_check_period

This parameter defines the interval between the health check in seconds. If set to 0, the health check is disabled. The default is 0.

health_check_timeout

This parameter controls the timeout before giving up the connecting attempt to PostgreSQL in seconds. The default is 20. Pgpool-II uses socket access system calls such as connect(), read(), write() and close(). These system calls could hang if the network connection between pgpool-II and PostgreSQL is broken, and the hung could last until the TCP stack in the kernel gives up. This could be as long as two hours in most operating systems. Apparently this is not good. The solution is setting a timeout before calling those system calls: health_check_timeout. Please note that health_check_timeout must be shorter enough than health_check_period. For example, If health_check_timeout is 20, health_check_period should be 30 or more.

health_check_max_retries

health_check_retry_delay

Sometimes network connections can be temporary unstable for various reasons. If health_check_max_retries is greater than 0, pgpool-II tries to repeat the health check up to health_check_max_retries times or succeeded in the health check. Interval between each retry is defined by health_check_retry_delay. The default for health_check_max_retries is 0, which disables the retry. The default for health_check_retry_delay is 1 (second).

Please note that "health_check_max_retries * (health_check_timeout+health_check_retry_delay)" should be smaller than health_check_period.

Following setting satisifes the formula.

health_check_period = 40
health_check_timeout = 10
health_check_max_retries = 3
health_check_retry_delay = 1

Please refer to pgpool-II document for more details.
http://www.pgpool.net/mediawiki/index.php/Documentation#Official_documentation

9 comments:

UnknownMay 14, 2014 at 10:38 AM
Hi Ishii,

We have configured pgpool 3.3 for our PostgreSQL db in aws. And using the healthcheck parameters as below.

health_check_period = 40
health_check_timeout = 10
health_check_max_retries = 2
health_check_retry_delay = 2

I am trying to test this by using an iptables rule in the active server. When I add a drop rule to the PostgreSQL db port it does trigger a timeout but I think it timeout too soon. I.e. according to my understanding it should wait for the time period specified in "health_check_timeout " in this case, but actually times out in less than a second.(Please see the log below). Shouldn't it wait for 10 secs before time out or is my understanding incorrect

2014-05-13 12:14:43 DEBUG: pid 15822: health_check: 0 th DB node status: 1
2014-05-13 12:14:44 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:44 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
2014-05-13 12:14:44 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
2014-05-13 12:14:44 DEBUG: pid 15822: health_check: 0 th DB node status: 1
2014-05-13 12:14:45 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:45 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
2014-05-13 12:14:45 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
2014-05-13 12:14:45 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:45 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:45 LOG: pid 15822: health check retry sleep time: 2 second(s)
2014-05-13 12:14:46 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:47 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:47 DEBUG: pid 15822: retrying 1 th health checking
2014-05-13 12:14:47 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:47 DEBUG: pid 15822: health_check: 0 th DB node status: 1
2014-05-13 12:14:48 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:48 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
2014-05-13 12:14:48 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
2014-05-13 12:14:48 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:48 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:48 LOG: pid 15822: health check retry sleep time: 2 second(s)
2014-05-13 12:14:49 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:50 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:50 DEBUG: pid 15822: retrying 2 th health checking
2014-05-13 12:14:50 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:50 DEBUG: pid 15822: health_check: 0 th DB node status: 1
2014-05-13 12:14:51 LOG: pid 15857: connect_inet_domain_socket: select() timed out. retrying...
2014-05-13 12:14:51 ERROR: pid 15822: connect_inet_domain_socket: select() timed out
2014-05-13 12:14:51 ERROR: pid 15822: make_persistent_db_connection: connection to 10.0.0.5(5432) failed
2014-05-13 12:14:51 ERROR: pid 15822: health check failed. 0 th host 10.0.0.5 at port 5432 is down
2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:51 DEBUG: pid 15822: health check: clearing alarm
2014-05-13 12:14:51 LOG: pid 15822: set 0 th backend down status
2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler called
2014-05-13 12:14:51 DEBUG: pid 15822: failover_handler: starting to select new master node

ReplyDelete
Replies
KarthickNovember 10, 2014 at 12:34 PM
Hi we have postgres 9.3.4 with pgpool 3.3 configured. we are facing the below errors and the pgpool goes out of sync though the streaming replication works fine:

:2014-11-10 02:47:25 ERROR: pid 26332: connect_inet_domain_socket: select() timed out
2:2014-11-10 02:47:25 ERROR: pid 26332: make_persistent_db_connection: connection to node0(5432) failed
3:2014-11-10 02:47:25 ERROR: pid 26332: health check failed. 0 th host node0 at port 5432 is down

the health_check_timeout is set to 15seconds and the system admin confirmed that there is network latency between the 2 nodes. there are no iptable rules set as well
Any idea why this issue is happening quiet frequenbtly and anything can be done to fix it permamently?

Thanks Karthick
ReplyDelete
Replies
AnonymousNovember 3, 2015 at 5:44 AM
Thanks for the detailed explanation. I did want to ask, what would happen if the formula wasn't met properly -- i.e. health_check_period == health_check_timeout with a retry of 2 and retry delay of 1?
ReplyDelete
Replies
UnknownFebruary 12, 2017 at 2:59 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
BrandonAugust 3, 2017 at 4:40 PM
Thanks for the information.I will keep this in mind.And be more attentive about it.
ReplyDelete
Replies
AnalyticspathDecember 6, 2019 at 6:24 PM
This comment has been removed by the author.
ReplyDelete
Replies
NikolayApril 6, 2021 at 8:01 PM
Hello! Did I understand correctly from your post that pjpool hangs in case of loss of connection to the database?
This is very important for me to know.
I would be grateful for your answer!
Looking forward to your response.
With best wishes, Nikolay.
ReplyDelete
Replies
Tatsuo IshiiApril 19, 2021 at 1:45 PM
> pjpool hangs in case of loss of connection to the database?
Yes. While pgpool's health check is retrying to confirm the connection to PostgreSQL, new connection to Pgpool will be suspended.
ReplyDelete
Replies