Playing with PostgreSQL and Pgpool: Visibility with query cache

Monday, April 19, 2021

Visibility with query cache

Pgpool-II provides "query cache" feature which speeds up SELECT to database. It stores query results to specified storage (either shared memory or memcached). If same query (more precisely, SELECT query which has identical text) arrives, Pgpool-II returns the cached result without accessing the database. This is very fast because:

Query results are fetched from memory, rather than database's (typically on desk) storage
Does not use CPU or memory on database server

Is there any pitfall?

Whether a SELECT result is fetched from the database or cache is transparent to client. That is, you don't realize the distinction except the response speed. Ok, sounds great. But is there any pitfall to use the query cache? One is, cache hit rates. Since any modifications to underlying table clears the cache, on a system which involves frequent updates are not suitable for the feature. According to the manual, system with lower than 70% cache hit rates is recommended to use the query cache.

Row visibility rules

Another pitfall you may need to care about is row visibility rules. The rule defines how the table rows are visible to other sessions (so you don't need to worry about this if you are the only user in the database). PostgreSQL provides several transaction isolation levels. The row visibility rule will vary depending on the transaction isolation level used in the transaction. For example, with the read committed isolation level (which is the default), other users will not see the new data of rows updated by your transaction until it gets committed.

However with repeatable read isolation level, other users will not see the new data of rows updated by your transaction even after the transaction gets committed. They are visible only after the user starts a new transaction.

Query cache could break the visibility rule

Pgpool-II's query cache does not follow the visibility rule. The cache is created at a transaction gets committed. The difference is, once cache is created, the cache entry can be seen by any transaction. This is fine with read committed isolation level. But with repeatable read isolation level, it breaks the visibility rule.

Suppose query cache feature is used in repeatable read transaction isolation level. This time, after UPDATE, SELECT is issued to create a cache entry.

Unlike previous figure, user's transaction will see the new data once your transaction gets committed because the new data will be registered when the transaction gets committed and the data can been seen by anyone. This could happen in not only repeatable read transaction isolation level, but in serializable transaction isolation level.

This is somewhat similar to the phenomenon you may see while using COPY FREEZE. Suppose the transaction isolation level is repeatable read and you are copying data to table using COPY FREEZE in a transaction. Other transaction which has started before your transaction started will see the copied data once your transaction gets committed. This is because "frozen" rows can be seen by anyone.

Conclusion

Pgpool-II's query cache is a useful feature for read intensive busy system. However this may cause certain read anomaly in repeatable read and serializable transaction isolation level. So please use the feature carefully with those transaction isolation level.

2 comments:

intgrApril 19, 2021 at 7:26 PM
> Query results are fetched from memory, rather than database's (typically on desk [sic]) storage

That's a misleading claim.

If the accessed data is warm enough to stay in pgpool's cache, it's also very likely to be cached in RAM on the database server, either by Postgres's own shared buffers, or in the operating system's page cache. In most database workloads it's rare to hit the disk for reads.
ReplyDelete
Replies

Playing with PostgreSQL and Pgpool