Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: david(dot)g(dot)johnston(at)gmail(dot)com, vik(at)postgresfriends(dot)org, jacob(dot)champion(at)enterprisedb(dot)com, er(at)xs4all(dot)nl, peter(at)eisentraut(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2024-12-30 13:37:18
Message-ID: 20241230.223718.1231120135312685954.ishii@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have added further optimization to the v25 patch.

While generating possible input strings that may satisfy the pattern
string, it is possible to omit to run regexp in some cases. Since
regexp matching is heavy operation, especially if it is applied to
long string, it is beneficial for RPR to reduce the number of regexp
runs.

If the tail pattern variable has '+' quantifier and previously the
input string was confirmed to be matched the pattern string, and the
same character as the tail pattern string is added, we don't need run
regexp match again because the new input string surely matches the
pattern string. Suppose a pattern string is "ab+" and the current
input string is "ab" (this satisfies "ab+"). If the new input string
is "abb", then "abb" surely matches "ab+" too and we don't need to run
regexp again.

In v26 patch, by using the technique above I get performance
improvement.

>> EXPLAIN (ANALYZE)
>> SELECT aid, bid, count(*) OVER w
>> FROM pgbench_accounts WHERE aid <= 10000
>> WINDOW w AS (
>> PARTITION BY bid
>> ORDER BY aid
>> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>> AFTER MATCH SKIP PAST LAST ROW
>> INITIAL
>> PATTERN (START UP+)
>> DEFINE
>> START AS TRUE,
>> UP AS aid > PREV(aid)
>> );

This SQL took 322.5913 ms (average in 3 runs) in v24. With v26 patch,
it takes 41.84 ms, which is over 7 times improvement. Also I run the
SQL in 100k row case. v23 took 26 seconds. With the v26 patch it takes
1195.603 ms, which is over 21 times improvement.

I think a pattern string ended up with '+' is one of common use cases
of RPR, and I believe the improvement is useful for many RPR
applications.

I also add following changes to v25.

- Fix do_pattern_match to use the top memory context to store compiled
re cache. Before it was in per query memory context. This causes a
trouble because do_pattern_match checks the cache existence using
a static variable.

- Refactor search_str_set, which is a workhorse of pattern matching,
into multiple functions to understand the logic easier.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Attachment Content-Type Size
v26-0001-Row-pattern-recognition-patch-for-raw-parser.patch text/x-patch 19.9 KB
v26-0002-Row-pattern-recognition-patch-parse-analysis.patch text/x-patch 11.6 KB
v26-0003-Row-pattern-recognition-patch-rewriter.patch text/x-patch 4.1 KB
v26-0004-Row-pattern-recognition-patch-planner.patch text/x-patch 6.6 KB
v26-0005-Row-pattern-recognition-patch-executor.patch text/x-patch 60.3 KB
v26-0006-Row-pattern-recognition-patch-docs.patch text/x-patch 9.7 KB
v26-0007-Row-pattern-recognition-patch-tests.patch text/x-patch 50.6 KB
v26-0008-Row-pattern-recognition-patch-typedefs.list.patch text/x-patch 1.2 KB
v26-0009-Allow-to-print-raw-parse-tree.patch text/x-patch 785 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2024-12-30 15:06:01 Re: Logging parallel worker draught
Previous Message Nazir Bilal Yavuz 2024-12-30 12:39:18 Re: add support for the old naming libs convention on windows (ssleay32.lib and libeay32.lib)