Friday, July 19, 2024

Row pattern recognition feature for PostgreSQL

What is row pattern recognition feature?

Row pattern recognition (RPR) is a feature defined in the SQL standard. It allows to search for a sequence of rows by pattern.  Since I am working on this feature for PostgreSQL, I would like to give a brief introduction to RPR.
 
Consider a table holding date and daily stock price of a company. 
 company  |   tdate    | price
----------+------------+-------
 company1 | 2024-07-01 |   100
 company1 | 2024-07-02 |   200
 company1 | 2024-07-03 |   150
 company1 | 2024-07-04 |   140
 company1 | 2024-07-05 |   150
 company1 | 2024-07-06 |    90
 company1 | 2024-07-07 |   110
 company1 | 2024-07-08 |   130
 company1 | 2024-07-09 |   120
 company1 | 2024-07-10 |   130
(10 rows)
Suppose you want to find a sequence of rows in which the stock price rises once or more then falls once or more. For example, the stock price on July 1 is 100, then rises to 200 on July 2 and then falls to 150 and 140 on July 2 to July 3. RPR allows users to write this kind of queries in intuitive way.
 

RPR syntax

 To express the query in RPR, you first define row pattern  variables.
 
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)

Here DEFINE is a keyword to start the definition. START, UP and DOWN are row pattern variable names. The right hand side of AS is logical expression which the pattern variable needs to satisfy. The logical expression "TRUE" means always true. PREV() is a special function that only used in RPR (cannot use elsewhere, for example SELECT target or WHERE clause).  PREV takes a column name as an argument, and returns the previous row's column value. So if today's price is higher than yesterday's price, "price > PREV(price)" returns TRUE. DOWN has opposite meaning.

Once the row pattern variables are defined, you can define patterns to search for by using the row pattern variables.

PATTERN (START, UP+, DOWN+)

Here PATTERN is a keyword to start the definition. START matches any row and this is the starting point of the pattern matching. If the next row satisfies UP, then more rows are examined if they satisfy the logical expression UP because '+' is added to UP. This is a similar concept of regular expressions. If any row does not satisfy UP, then DOWN is tested against the row. If it satisfies, more rows are examined like UP. If no row matches DOWN, then the search finishes. 

The RPR syntax

The explanation above is just a part of RPR syntax. Now let's see the whole RPR syntax.
 
 WINDOW window_name AS (
[ PARTITION BY ... ]
[ ORDER BY... ]
[ MEASURES ... ]
ROWS BETWEEN CURRENT ROW AND ...
[ AFTER MATCH SKIP ... ]
[ INITIAL|SEEK ]
PATTERN (...)
[ SUBSET ... ]
DEFINE ...
)

As may notice, this is an expansion to WINDOW clause. Actually there are two types of RPR defined in the SQL standard: R010 and R020. R020 is the syntax above. Since I am working on implementing R020 in PostgreSQL,  I focus on R202 in this article. Also current implementation does not support MEASURE and SUBSET, I will not explain these in the rest of this article. If you are interested in R010 or MEASURE, SUBSET, please look into the SQL standard.
 

Let's run RPR

If I implement the example above in the R020 syntax, the whole query and the result will look like this:

SELECT company, tdate, price,
 count(*) OVER w
 FROM stock
 WINDOW w AS (
 PARTITION BY company
 ORDER BY tdate
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 AFTER MATCH SKIP PAST LAST ROW
 INITIAL
 PATTERN (START UP+ DOWN+)
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price),
  DOWN AS price < PREV(price)
);
 
 company  |   tdate    | price | count
----------+------------+-------+-------
 company1 | 2024-07-01 |   100 |     4
 company1 | 2024-07-02 |   200 |     0
 company1 | 2024-07-03 |   150 |     0
 company1 | 2024-07-04 |   140 |     0

 company1 | 2024-07-05 |   150 |     0
 company1 | 2024-07-06 |    90 |     4
 company1 | 2024-07-07 |   110 |     0
 company1 | 2024-07-08 |   130 |     0
 company1 | 2024-07-09 |   120 |     0

 company1 | 2024-07-10 |   130 |     0
(10 rows)


As we already see the stock price on July 1 is 100, then rises to 200 on July 2 and then falls to 150 and 140 on July 2 to July 3. The count column shows that the query found 4 matching rows. The reason why the count column for July 2 to 4 are 0 is they are skipped and are not evaluated. This behavior comes from "AFTER MATCH SKIP PAST LAST ROW" clause: as 4 rows are found to be matched, we skip 4 rows and restart next pattern matching from July 5. Since stock price on July 6 is 90, the pattern matching fails. Thus the count column is 0.
 
And next pattern match starts at July 6. Now we find 4 matching rows.
 
Next pattern matching  starts at July 10. The pattern match needs at least 3 rows. However there's only 1 ow remains in the partition, patter matching fails.
 

Variation of pattern

 The pattern matching can be easily extended. For example you can add more constraint to the pattern like "not only the stock price rises but the range is more than 50". In this case you can change the DEFINE clause as follows:
 
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price) + 50,
  DOWN AS price < PREV(price)
 
 With this change we get this:

 company  |   tdate    | price | count
----------+------------+-------+-------
 company1 | 2024-07-01 |   100 |     4
 company1 | 2024-07-02 |   200 |     0
 company1 | 2024-07-03 |   150 |     0
 company1 | 2024-07-04 |   140 |     0

 company1 | 2024-07-05 |   150 |     0
 company1 | 2024-07-06 |    90 |     0
 company1 | 2024-07-07 |   110 |     0
 company1 | 2024-07-08 |   130 |     0
 company1 | 2024-07-09 |   120 |     0
 company1 | 2024-07-10 |   130 |     0
(10 rows)

Please note that July 6 row does not match any more.

What's in the current implementation?

I have been working on this feature and publishing patches on pgsql-hackers mailing list.  The whole discussion can be found at [1]. The latest patch can be found near the end of the thread. At this point the supported features in the patch are explained in [2]. If you are interested more examples, probably the regression test result in the patch is what you want to check (src/test/regress/expected/rpr.out).

How to test the patch?

Since the patch is against PostgreSQL git repository master branch, you need to clone the git repository first.

git clone https://git.postgresql.org/git/postgresql.git
 
Save the patch somewhere. Suppose you save in /tmp/rpr. 

git apply /tmp/rpr/*.patch

Then build PostgreSQL as usual. Since the patch changes some system catalogue, you need to create a fresh database cluster by using initdb.

If you find something with the patch or have comments/suggestions, please post to [1].

Also you can watch a talk I gave at PGConf.dev 2024 [3] and the slide[4].

 
 
 



No comments:

Post a Comment

Row pattern recognition feature for PostgreSQL

What is row pattern recognition feature? Row pattern recognition (RPR) is a feature defined in the SQL standard. It allows to search for a s...