Saturday, July 25, 2020

Snapshot Isolation Mode

Pgpool-II 4.2 is under development

Pgpool-II developers have been working hard for upcoming Pgpool-II 4.2 release, which is supposed to be released around this September or October.

Snapshot isolation mode is coming

For the new release I have just added new clustering mode called "Snapshot isolation mode", which would be not only interesting for Pgpool-II users but PostgreSQL users and I think I decided to write about it in this blog.

Existing clustering mode

Until Pgpool-II 4.1, we have following clustering mode:
  • Streaming replication mode. This is most widely used at this point. PostgreSQL's streaming replication is used for the replication task in this mode.
  • Native replication mode. Until streaming replication is available, this was the most common setting. In this mode Pgpool-II issues same data modifying statements (DML/DDL) to all the PostgreSQL servers. As a result same data is replicated in all PostgreSQL servers.
  • Slony mode. In this mode replication is done by Slony-I instead of streaming replication. This had been popular until streaming replication came up.
  • Raw mode. No replication is done in this mode. Rarely used.
The new "Snapshot isolation mode" is pretty close to the existing native replication mode. It does not use the streaming replication at all. Replication is done by sending DML/DDL statements to all the PostgreSQL servers.  If so, what's the difference?

Atomic visibility

Before explaining that I would like to talk about the background a little bit.

PostgreSQL developers has been discussing about technology which can scale to handle large data. One of such a technical directions is extending FDW (Foreign Data Wrapper). It sends query to another PostgreSQL server to replicate or shard data. The idea is similar to the native replication mode in Pgpool-II because it also sends DML/DDL to servers to replicate data.
As for replicating data, they look good so far. However, there is a problem when multiple sessions try to access data at the same time.

Suppose we have 2 sessions: session 1 continuously executes UPDATE t1 SET i = i +1; session 2 continuously executes INSERT INTO log SELECT * FROM t1. Table t1 has only 1 row and it's initial value is 0 (actually each UPDATE and INSERT are executed in an explicit transaction).

So after session 1 and session 2 end, table "log" should have exactly same sequential numbers like "0, 1, 2, 3...." on both PostgreSQL servers. But see the example session below:



Session 1 executed UPDATE at t1 on servers1, while on server 2 UPDATE was executed on t2. Session 2 executed INSERT at t3 with value 1. Unfortunately on server2 INSERT was executed at t3, and t1 was not updated yet thus the value used was 0.

So after the sessions end, we will see different sequences of numbers in log table on server1 and server2. If we do not have atomic visibility this kind of data inconsistency could happen. Atomic visibility guarantees that value can be seen on server1 and server2 are same as long as they are executed in a same session. In the example above, data logged into log table would be either 0 (t1 has not been updated/committed yet) or 1 (t1 has been already updated/committed).

Snapshot isolation mode guarantees atomic visibility

The upcoming snapshot isolation mode is different from existing native replication mode in that it guarantees atomic visibility. So we don't need to worry about such that data inconsistency in the example above. In fact we have new regression to test the new snapshot isolation mode by using essentially same SQL used in the example above. Please note that in order to use snapshot isolation mode, the transaction isolation mode must be "REPEATABLE READ". (Postgres FDW also runs foreign transactions in this mode).

How to achieve atomic visibility?

PostgreSQL developers already recognize the necessity of atomic visibility. Postgres-XC, a fork of PostgreSQL, invented Global Transaction Manager to achieve atomic visibility. Some PostgreSQL developers are proposing CSN (Commit Sequence Number) approach. However these approch needs major surgery in PostgreSQL code.

Pgpool-II uses completely different approach, which is called "Pangea" proposed in an academic paper. With Pangea, no modification is necessary with PostgreSQL. Even you can use older versions of PostgreSQL like 8.0. So you can enjoy snapshot isolation mode of Pgpool-II today with your favorite version of PostgreSQL.

In Pagea the idea is called Global Snapshot Isolation, which extends snapshot isolation of local server to across servers. That's the reason why we call the new clustering mode as "Snapshot Isolation mode".

How Pangea works?

 With Pangea each first command in a transaction needs to wait if other transactions are trying to commit. If there are no commit is ongoing, the command will get snapshot which defines the visibility of the transaction. Since local transactions on each server acquire snapshots while there are no committing transactions, they can be regarded as consistent. In the example above, the INSERT command in session 2 will be forced to wait until t4, and at that time session 1 commits. Thus each INSERT command in session 2 will see "1" in table t1. Same thing can be said to the commit. Commit must wait if there's ongoing snapshot acquisition.

Summary

Pgpool-II 4.1 will come with new clustering mode called "Snapshot Isolation Mode" which guarantees the atomic visibility and is not provided in current standard PostgreSQL. Unlike other techniques to implement the atomic visibility, Pgpool-II does not require any modification to PostgreSQL. Also it can be used even with older version of PostgreSQL. Stay tuned.

Wednesday, August 7, 2019

Automatically updating materialized views

Materialized views are convenient and efficient way to retrieve information from database. Unlike ordinary views, materialized views save the query result and provide faster access to the data. For example, user can create a simple materialized view containing the number of rows in a table:

 CREATE MATERIALIZED VIEW mv1 AS SELECT count(*) FROM pgbench_accounts;

Obtaining the number of rows using the materialized view mv1 is much faster than  directly accessing number of rows in pgbench_accounts.

test=# SELECT * from v1;
 count 
--------
 100000
(1 row)

Time: 0.810 ms
test=# SELECT count(*) FROM pgbench_accounts;
 count 
--------
 100000
(1 row)

Time: 13.945 ms


However if you delete a row from pgbench_accounts, the number of rows in mv1 is still unchanged. To reflect the change of the base table (in this case pgbench_accounts) , you need to recreate or refresh (this actually recreate the contents of materialize views from scratch), which may take long time.

To overcome the problem, SRA OSS is proposing to add a new feature to existing materialized view  "incremental materialized view maintenance". This works like this.

  • add new syntax to allow incrementally update the materialized view when it is created.
  • install triggers automatically to be attached to the base table(s).
  • when UPDATE/DELETE/INSERT applies on the base tables, the trigger calculate the difference to the rows in the materialized view and update the rows.
These should update the materialized view in a real time manner. The price is the over head of trigger invocation. So frequently updating tables are not best suite for the incremental materialized view maintenance. Here are demonstrations how to use the incrementally updating materialized view.

First create incrementally updating materialized view.

CREATE INCREMENTAL MATERIALIZED view mv2 AS SELECT count(*) FROM pgbench_accounts;

The view initially says there are 100,000 rows.

SELECT * FROM mv2;
 count 
--------
 100000
(1 row)


Delete a row from pgbench_accounts. Now the number of rows is 99999.

DELETE FROM pgbench_accounts WHERE aid = 10;
DELETE 1
SELECT count(*) FROM pgbench_accounts;
 count
-------
 99999
(1 row)


Make sure that mv2 reports the same number of rows.

SELECT * FROM mv2;
 count
-------
 99999
(1 row)


It works!

This is a very simple materialized view. More complex queries, for example inner joins also work.

CREATE INCREMENTAL MATERIALIZED VIEW mv3 AS SELECT a.aid, b.bid, t.tid FROM pgbench_accounts a INNER JOIN pgbench_branches b ON (a.bid = b.bid ) INNER JOIN pgbench_tellers t ON (b.bid = t.bid) WHERE a.aid BETWEEN 1 AND 5;
SELECT 50


SELECT * FROM mv3 LIMIT 5;
 aid | bid | tid
-----+-----+-----
   3 |   1 |  10
   2 |   1 |   1
   2 |   1 |   5
   2 |   1 |   4
   5 |   1 |   3

 (5 rows)

DELETE FROM pgbench_accounts WHERE aid = 2;
DELETE 1


SELECT * FROM mv3 LIMIT 5;
 aid | bid | tid
-----+-----+-----
   3 |   1 |  10
   5 |   1 |   3
   1 |   1 |   1
   5 |   1 |   8
   4 |   1 |   3
(5 rows)


Implementing other types of queries such as outer joins, self joins are in our plan, possibly toward PostgreSQL 13.

Here are some pointers to information regarding our incremental materialized view maintenance project:

Wednesday, July 24, 2019

When primary server is far away from standby server

Sometimes we want to create a cluster configuration in which the primary server is physically located far away from standby servers. For example, the primary server is located in an AWS region A, while the standby server is located in an AWS region B. This type of configuration is suited for a DR (Disaster Recovery) or a company which has a branch in different countries. In the figure below, a company's head quarter is located in region A,  and people in the HQ office access the primary server database through Pgpool-II.

People in a branch office located in region B access standby database in mostly read only manner. In a few occasion they need to write the database but they understand that it needs to access HQ database which would be slow. Problem for them is, even if they just do a read access to the database, Pgpool-II needs to access system catalog in the primary server to obtain meta info of the tables involved in their query.



From Pgpool-II 4.1,  which is under development, user could eliminate the slow system catalog access in from region B to region A by setting relcache_query_target = load_balance_node and backend weight of primary server to 0 in Pgpool-II running in region B.

Now the system catalog access is redirected to the standby server, rather than to the primary server and the access speed should faster.

Please note that , however, there may be replication delay and that could cause a trouble if the target table is pretty new because the system catalog in the standby server may not the have the new table info. So this configuration is best suitable for a system the table creation is rare, or they are created while users do not access the system.

By the way, Pgpool-II developers define the target date for release 4.1 of Pgpool-II in this October. Please stay tuned.

Tuesday, April 2, 2019

Statement level load balancing

In the previous article I wrote about one of the new features of upcoming Pgpool-II 4.1.
This time I would like to introduce "statement level load balancing" feature of 4.1.

Pgpool-II can distribute read queries among PostgreSQL backend nodes. This allows to design a scale out cluster using PostgreSQL. The particular database node used for distributing read query is determined at the session level: when a client connects to Pgpool-II. This is so called "session level load balancing".

 For example, if a client connects to Pgpool-II and the load balance node is node 1 (we assume that this is a streaming replication standby node), then any read query will be sent to the load balance node (in this case node1, the standby node). Different session may choose master  (primary) as the load balance node. The distribution ratio of the load balancing node selection in different sessions is determined by "backend weight" parameter in the Pgpool-II configuration file (usually named "pgpool.conf"), typically "backend_weight0" or "backend_weight1", corresponding to node 0 and node 1 respectively.

This is good as long as clients connects to Pgpool-II, issue some queries, and disconnect, since next time a client connects to Pgpool-II, different load balance node may be chosen according to the backend weight parameters.

However, if your client already has a connection pooling feature, this way (session level load balancing) might be a problem, since the selection of load balance node is performed only once when the connection pooling from client to Pgpool-II is created.

The statement level load balancing feature is created to solve the problem. Unlike the session level load balancing, the load balancing node is determined when a new query is issued. The new parameter for this is "statement_level_load_balance". If this is set to on, the feature is enabled (the parameter can be changed by reloading the pgpool.conf).

At first "select_cnt" is 0, which means no SELECTs were issued.

test=# show pool_nodes;

 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change 
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 0          | true              | 0                 | 2019-04-02 15:36:58
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 0          | false             | 0                 | 2019-04-02 15:36:58
(2 rows)

Let's issue a SELECT.

test=# select 1;
?column? 
----------
        1
(1 row)
test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change  
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 1          | true              | 0                 | 2019-04-02 15:36:58
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 0          | false             | 0                 | 2019-04-02 15:36:58
(2 rows)

Now the select_cnt of node 0 is 1, which means the SELECT was sent to node 0. Also please note that "load_balance_node" colum of node 0 is "true", which means node 0 is chosen as the load balance node in the last query.

Ok, let's issue another SELECT:

test=# select 2;
 ?column? 
----------
        2
(1 row)

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change  
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 1          | false             | 0                 | 2019-04-02 15:36:58
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 1          | true              | 0                 | 2019-04-02 15:36:58
(2 rows)

Now the load_balance_node is changed to node 1, and the select_cnt of node 1 becomes 1. This is how the statement level load balancing works.

Sunday, March 24, 2019

Shared Relation Cache

 System catalogs?

Pgpool-II needs to access PostgreSQL's system catalogs whenever it recognizes tables in user's query.  For example, Pgpool-II has to know whether the table in question is a temporary table or not. If it's a temporary table, then the query using the temporary table must be routed to the primary PostgreSQL, rather than one of standby PostgreSQL servers because  PostgreSQL does not allow to create temporary tables on standby servers. Another use case is converting the table name to OID (Object Identifier). OIDs are unique keys for objects managed in PostgreSQL's system catalogs.

Same thing can be said to functions. Details of functions, for instance whether they are "immutable" or not, is important information since it affects the decision on which the query result using the function should be cached or not when query cache feature is enabled.

Local query cache for system catalogs

Sometimes Pgpool-II needs to issue up to as many as 10 queries to the system catalog when it sees a table or function for the first time. Fortunately Pgpool-II does not wast the query results. They are stored in local cache (wee call it "Relation Cache" or "Relcache"), and next time it sees the object in the same or different queries, it extracts info from the local cache . So far so good.

Problem is, the local cache is stored in private memory in Pgpool-II' s child process. For each new session from Pgpool-II clients, different child process is assigned for the session. So even if single table is used in queries, Pgpool-II continues to access system catalogs until the table's info gets filled  in all the local caches.

Shared relation cache

How to mitigate the problem? One of the solutions would be sharing the relation cache info among Pgpool-II processes. This way,  once one of the processes accesses the system catalogs and obtains the info, other processes do not need to access the system catalogs any more. The cache shared by the processes is called "shared relation cache".

How to implement it?

But how to implement it? Fortunately Pgpool-II already has shared query cache. Why can't we store the shared relation cache on it? Here's the idea:
  •  If the table/function info is not in the local relation cache, check the shared relation cache.
  • If it  is not in the shared relation cache, access the system catalogs and store the info in the local cache. Also copy the info to the shared relation cache,
  • If the table/function info is already in the shared relation cache, just copy the info to the local cache. 
You might wonder why there are two kinds of cache: local one and shared one. The reason is locking. Since local cache is never be accessed by multiple processes, it does not need any locking, while the shared relation cache can be accessed by multiple processes, it must be guarded by locking, and this could be a serious bottle neck if there are many processes.

 Cache invalidation

Any cache needs to be invalidated someday.  In the current implementation the cache invalidation is based on timeout. The timeout value can be specified using "relcache_expire" parameter, which controls the local cache timeout as well.

Is it faster?

Is shared relation cache is faster? Well, it depends on use cases.  If there are very small number of tables or functions, overhead of new shared relation cache will not give advantages. However there are many tables/functions, definitely it wins. This is the reason why Pgpool-II has a switch (enabled_shared_relcache) to enable or disable the feature.

To demonstrate the case when the feature wins, I created  1, 2, 4, 8, 16, 32, 64, 128 and 256 tables (all tables are empty), and accessed them using pgbench. pgbench's option is as follows:

pgbench  -C -T 30 -c 30 -n -f script_file

pgbench ran 3 times for each session and I used the average of the numbers.
The script file includes 1-128 SELECTs to access each table.

The blue line (Speed = 1) is the base line, i.e. when the feature is disabled. The red line is when the feature is enabled. As you can see as the number of tables increases, performance increases as well, up to 32 tables. As the number of tables increases, performance is getting lower but still the performance when shared relcache being on is superior than off.

The result may differ according to the workload: if the SELECT is heavy, then the effect may be weaken because the longer access time of SELECT hides the effect of shared relache.

When it will be available?

The feature is already comitted into Pgpool-II version 4.1, which is supposed to be released around September 2019.  So stay tuned!

Tuesday, December 4, 2018

log_client_messages in Pgpool-II 4.0

Pgpool-II 4.0 adds new logging feature called "log_client_messages". This allows to log messages coming from frontend. Up to 3.7 the only way to log frontend messages was enable debugging log, which produced tremendous amount of logs.

For example, with log_client_messages enabled, "pgbench -S -M parepared -t 2" produces frontend logs below:

2018-12-04 16:43:45: pid 6522: LOG:  Parse message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  statement: "P0_1", query: "SELECT abalance FROM pgbench_accounts WHERE aid = $1;"
2018-12-04 16:43:45: pid 6522: LOG:  Sync message from frontend.
2018-12-04 16:43:45: pid 6522: LOG:  Bind message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: "", statement: "P0_1"
2018-12-04 16:43:45: pid 6522: LOG:  Describe message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: ""
2018-12-04 16:43:45: pid 6522: LOG:  Execute message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: ""
2018-12-04 16:43:45: pid 6522: LOG:  Sync message from frontend.


As you can see, pgbench sends a query "SELECT abalance FROM pgbench_accounts WHERE aid = $1;" using prepared statement "P0_1", then bind message to bind parameter to be bound to "$1".
It then sends describe message to obtain meta data, and finally sends execute message to run the query.

Below are the second execution of query (remember that we add "-t 2" parameter to execute 2 transactions).

2018-12-04 16:43:45: pid 6522: LOG:  Bind message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: "", statement: "P0_1"
2018-12-04 16:43:45: pid 6522: LOG:  Describe message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: ""
2018-12-04 16:43:45: pid 6522: LOG:  Execute message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL:  portal: ""
2018-12-04 16:43:45: pid 6522: LOG:  Sync message from frontend.
2018-12-04 16:43:45: pid 6522: LOG:  Terminate message from frontend.


This time no parse message is sent because pgbench reuses the named statement "P0_1", which eliminates the parse/analythis step. So pgbench runs in the mode very fast comparing with other mode.

In summary log_client_messages is useful when you want to extract very detail info regarding what the client is doing.

Friday, October 19, 2018

Pgpool-II 4.0 released! (SCRAM authentication)

After 1 year of hard work by Pgpool-II developers,  Pgpool-II 4.0 is finally out.
This version celebrates the 15th anniversary of Pgpool-II, and introduces rich new features. I will explain them one by one in a series of blogs.

Support for SCRAM authentication

SCRAM authentication is a modern, secure authentication method supported by PostgreSQL 10 or later. Pgpool-II 4.0 supports the method by providing multiple new functionalities.

  • pool_hba.conf now has "scram-sha-256" authentication method.
  • If this method is specified, users connecting to Pgpool-II are required  to response in proper way of SCRAM authentication.
  • Passwords stored in pool_passwd can be used to verify user's password before a session starts.
  • Passwords can be store in pool_passwd in multiple ways: clear text or AES256 encrypted (md5 hashed passwords can also be stored in pool_passwd but in this case SCRAM cannot use the password).
  • AES256 encryption is preferred and the key to decrypt it is stored in .pgpoolkey under the user's home directory who started Pgpool-II, or a file specified by PGPOOLKEYFILE environment variable.
  • For AES256 encryption, new tool pg_enc can be used.
Of course, Pgpool-II can authenticate itself in SCRAM while connecting to PostgreSQL, if it requires SCRAM authentication to Pgpool-II.

Good thing with Pgpool-II 4.0 is, it allows users to choose different authentication method in users vs. Pgpool-II  and Pgpool-II vs. PostgreSQL. For example, using SCRAM between users and Pgpool-II, while using md5 authentication method between Pgpool-II and PostgreSQL.

SCRAM authentication can also be used for health check and replication delay check.  Before 4.0, user name and password for them are stored in clear text format in pgpool.conf. This is not the most secure way. Pgpool-II 4.0 allows to store AES 256 encrypted passwords in pgpool.conf. .pgpookey is also used for decrypting these passwords.

Next time I will explain about other features of authentication method in Pgpool-II 4.0.

Friday, June 15, 2018

Even more fine load balancing control with Pgpool-II 4.0

While back go, I posted an article titled "More load balancing fine control" to explain a new feature of upcoming Pgpool-II 4.0. Today I would like talk about yet another new feature of load balancing in Pgpool-II.

Pgpool-II is already able to control read query load balancing in several granularity:
(See the document for more details).
  •  Whether enable load balancing or not (load_balance_mode)
  • By database (database_redirect_preference_list)
  • By application name (app_name_redirect_preference_list)
  • By functions used in the query (white_function_list, black_function_list)
  • Statement level (/*NO LOAD BALANCE*/ comment)
 The last method allows the statement level control on load balancing but it needs to rewrite a query which is often impossible if you are using commercial software.

A new configuration parameter called "black_query_pattern_list" allows you to disable load balancing for queries specified by the parameter. The parameter takes a regular expression string. If a query matched with the expression, load balancing is disabled for the query: the query is sent to the primary (master) PostgreSQL server.

Here are some examples.

We have:
black_query_pattern_list = 'SELECT \* FROM t1\;;'
in pgpool.conf. Note that some special characters must be qualified by using '\' (back slash) characters.

We have no SELECTs issued to both primary and standby.

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change 
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 0          | false             | 0                 | 2018-06-15 11:57:25
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 0          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)


If following query is issued, then the "select_cnt" column of standby should be incremented since standby is the load balance node (notice the "role" column").

test=# SELECT * FROM t1;

But because of the effect of  black_query_pattern_list, the SELECT is redirected to primary.

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change 
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 1          | false             | 0                 | 2018-06-15 11:57:25
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 0          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)



However, "SELECT * FROM t1 WHERE i = 0" will be sent to standby since the expression specied in black_query_pattern_list does not match with the query.

test=# select * from t1 where i = 1;
 i
---
(0 rows)


test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change 
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 1          | false             | 0                 | 2018-06-15 11:57:25
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 1          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)


If you want to match with any query using table t2, you want to specify something like:

  .* t2;;.* t2 .*;

The first part ".* t2;" matches any query without any qualification (example: "SELECT * FROM t2"), while latter matches any query having space after t2 (example: "SELECT * FROM t2 WHERE i = 1).

Please note that Pgpool-II does not do any syntax analysis on the query here. Please be careful when specifying a regular expression. For instance, "SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2)" will not match with any of the regular expression pattern above. So the query will be sent to standby. This may or may not follow your expectations.

By the way, you may already noticed the extra column "last_status_change" of "show pool_nodes" command. This indicates the times stamp when any of "status" or "role" has been changed. This is useful when you want to see log lines when failover happend last time.

Friday, May 18, 2018

statement_timeout changed in PostgreSQL 11

PostgreSQL allows to set time limit for each SQL statement. For example,

SET statement_timeout to '4s';

will set the upper limit of execution time for each subsequent query to 4 seconds. So,

SELECT pg_sleep(5);

will cancel  4 seconds after.

ERROR:  canceling statement due to statement timeout

Simple and pretty easy to understand.

But if it is applied  to extended queries, things are not so simple any more. Each extended query is divided into multiple phases:

  1. Parse: parse SQL statement
  2.  Bind: bind prameters to the parsed SQL statement
  3.  Execute: run the query
  4.  Sync: ask PostgreSQL to return the query results
 So when the statement timeout fires? The answer is 4, not 3. Until sync message is recieved, the statement timeout will not be checked in extended queries.

Ok. Consider next example. We assume that statement timeout has been set to 4 seconds.
  1. Parse: SELECT pg_sleep(2) (Query A)
  2.  Bind: bind prameters to the parsed SQL statement A
  3.  Execute: run the query A
  4. Parse: SELECT pg_sleep(3) (Query B)
  5. Bind B
  6.  Execute B
  7.  Sync
This time, the statment timeout will be fired even if each query A and B is finished  within 4 seconds. This is not very intuitive behavior I think. More over, if duration log is enabled, the time for each query A and B will be 2 seconds and 3 seconds of course. So users will be confused because despite the fact that each query definitely finishes within 4 seconds, the statement timer is fired.

From PostgreSQL 11, this behavior will be changed. The statement timeout will be checked at the time when Execute message is issued (3 and 6 above). So the statement timer will not be fired in the example.

In summary, statement timeout in PostgreSQL 11 will show more intuitive behavior than previous releases.

Tuesday, April 24, 2018

More load balancing fine control

Pgpool-II is known as it provides load balancing across multiple PostgreSQL servers for read queries.

Consider a cluster consisting of Pgpool-II and two PostgreSQL servers configured with streaming replication. One is a primary server, the other is a standby server. Also the load balancing weight is 0:1, which means read queries will be basically sent to the standby server.

In an explicit transaction if a read query is sent from a client, Pgpool-II sends it to the standby server (as long as replication delay is not too much). Next a write query is coming. Of course it is sent to the primary server. Interesting thing happens if a read query is sent. Unlike previous read query, it is sent to the primary to avoid the replication delay.


Pgpool-II 4.0 will have a new configuration parameter called "disable_load_balance_on_write" to give more fine control of the behavior above.

If disable_load_balance_on_write = 'transaction', the behavior is exactly same as above. This is the default.

If disable_load_balance_on_write = 'off', Pgpool-II no longer takes account of the write query and feels free to load balance read queries as much as possible.

 This choice is good if you want to take maximum benefit of load balancing.

The variant of 'tranacton' is, 'trans_transaction'. Unlike 'transaction', the effect of disable load balancing after a write query continues even if the transaction closes and a next transaction begins.


So 'trans_transaction' is good for those who want data consistency than load balancing. Please note that a ready query issued between explicit transactions are not affected by the parameter.

Finally there's 'always'. If the mode is specified and once write query is sent, any read query regardless it's in an explicit transaction or not, will not be load balanced and sent to the primary server until the session ends.

In summary, the degree of protecting data consistency is in the order of always, trans_transaction, transaction and off. In contrast, the degree of performance is in the order of off, transaction, trans_transaction and always. If you are not sure which one is the best choice for you, I recommend you to start with 'transaction' because this is on a good balance between data consistency and performance.

Wednesday, April 18, 2018

Detecting "false" primary server of PostgreSQL

One of upcoming Pgpool-II 4.0's interesting features will be: "false" primary server detection in streaming replication environment.

Suppose we have 1 primary server and two standby servers connection the primary server.

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | /tmp     | 11002 | up     | 0.000000  | primary | 1          | false             | 0
 1       | /tmp     | 11003 | up     | 0.000000  | standby | 0          | false             | 0
 2       | /tmp     | 11004 | up     | 1.000000  | standby | 0          | true              | 0
(3 rows)


What will happen if the node 2 standby server is promoted to primary?

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | /tmp     | 11002 | up     | 0.000000  | primary | 1          | false             | 0
 1       | /tmp     | 11003 | up     | 0.000000  | standby | 0          | false             | 0
 2       | /tmp     | 11004 | up     | 1.000000  | standby | 0          | true              | 0
(3 rows)


As you can see nothing has been changed as far as show pool_nodes command goes.
But: actually node 2 is not the standby connected to the primary anymore. So if large updates are sent to the primary, the node 2 is far behind the primary server since data is not replicated to the node any more.

t-ishii@localhost: pgbench -i -p 11000 test
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.19 s, remaining 0.00 s)
vacuum...
set primary keys...
done.


 test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | /tmp     | 11002 | up     | 0.000000  | primary | 1          | false             | 0
 1       | /tmp     | 11003 | up     | 0.000000  | standby | 0          | false             | 0
 2       | /tmp     | 11004 | up     | 1.000000  | standby | 0          | true              | 13100296
(3 rows)


How can we detect the situation and fix it?

Pgpool-II 4.0 will help you. It will have a new parameter called "detach_false_primary". If the parameter is enabled, Pgpool-II will automatically detect the situation and detach the node 2 because it's a"false" primary.

 test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | /tmp     | 11002 | up     | 0.000000  | primary | 1          | false             | 0
 1       | /tmp     | 11003 | up     | 0.000000  | standby | 0          | false             | 0
 2       | /tmp     | 11004 | down   | 1.000000  | standby | 0          | true              | 13100296
(3 rows)


Note that there's a limitation in the feature. For example, if  there are 4 nodes: primary 1 connects to standby 1, while primary 2 connect to standby 2. There's no reliable way to decide which node is the "true" primary. In this case the false primary detection does not work.

In the figure bellow explains what Pgpool-II 4.0 can do and cannot.
You can read more details of the feature from the manual.


Snapshot Isolation Mode

Pgpool-II 4.2 is under development Pgpool-II developers have been working hard for upcoming Pgpool-II 4.2 release, which is supposed to b...