Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: jacob(dot)champion(at)enterprisedb(dot)com
Cc: vik(at)postgresfriends(dot)org, pgsql-hackers(at)postgresql(dot)org, er(at)xs4all(dot)nl, peter(at)eisentraut(dot)org
Subject: Re: Row pattern recognition
Date: 2024-09-28 10:43:59
Message-ID: 20240928.194359.1327378331465099656.ishii@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> With some bigger partitions, I hit an `ERROR: wrong pos: 1024`. A
> test that reproduces it is attached.

Thanks for the report. Attached is a patch on top of v22 patches to
fix the bug. We keep info in an array
(WindowAggState.reduced_frame_map) to track the rpr pattern match
result status for each row in a frame. If pattern match succeeds, the
first row in the reduced frame has status RF_FRAME_HEAD and rest of
rows have RF_SKIPPED state. A row with pattern match failure state has
RF_UNMATCHED state. Any row which is never tested has state
RF_NOT_DETERMINED. At begining the map is initialized with 1024
entries with all RF_NOT_DETERMINED state. Eventually they are replaced
with other than RF_NOT_DETERMINED state. In the error case rpr engine
tries to find 1024 th row's state in the map and failed because the
row's state has not been tested yet. I think we should treat it as
RF_NOT_DETERMINED rather than an error. Attached patch does it.

> While playing with the feature, I've been trying to identify runs of
> matched rows by eye. But it's pretty difficult -- the best I can do is
> manually count rows using a `COUNT(*) OVER ...`. So I'd like to
> suggest that MEASURES be part of the eventual v1 feature, if there's
> no other way to determine whether a row was skipped by a previous
> match. (That was less obvious to me before the fix in v17.)

I think implementing MEASURES is challenging. Especially we need to
find how our parser accepts "colname OVER
window_definition". Currently PostgreSQL's parser only accepts "func()
OVER window_definition" Even it is technically possible, I think the
v1 patch size will become much larger than now due to this.

How about inventing new window function that returns row state instead?

- match found (yes/no)
- skipped due to AFTER MATCH SKIP PAST LAST ROW (no match)

For the rest of the mail I need more time to understand. I will reply
back after studying it. For now, I just want to thank you for the
valuable information!

> --
>
> I've been working on an implementation [1] of SQL/RPR's "parenthesized
> language" and preferment order. (These are defined in SQL/Foundation
> 2023, section 9.41.) The tool gives you a way to figure out, for a
> given pattern, what matches are supposed to be attempted and in what
> order:
>
> $ ./src/test/modules/rpr/rpr_prefer "a b? a"
> ( ( a ( b ) ) a )
> ( ( a ( ) ) a )
>
> Many simple patterns result in an infinite set of possible matches. So
> if you use an unbounded quantifiers, you have to also use --max-rows
> to limit the size of the hypothetical window frame:
>
> $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $"
> ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ )
> ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ )
> ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ )
> ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ )
> ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ )
> ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ )
> ( ( ^ ( ) ) $ )
>
> I've found this useful to check my personal understanding of the spec
> and the match behavior, but it could also potentially be used to
> generate test cases, or to help users debug their own patterns. For
> example, a pattern that has a bunch of duplicate sequences in its PL
> is probably not very well optimized:
>
> $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+"
> ( ( a a a ) ( a ) )
> ( ( a a ) ( a a ) )
> ( ( a a ) ( a ) )
> ( ( a ) ( a a a ) )
> ( ( a ) ( a a ) )
> ( ( a ) ( a ) )
>
> And patterns with catastrophic backtracking behavior tend to show a
> "sawtooth" pattern in the output, with a huge number of potential
> matches being generated relative to the number of rows in the frame.
>
> My implementation is really messy -- it leaks memory like a sieve, and
> I cannibalized the parser from ECPG, which just ended up as an
> exercise in teaching myself flex/bison. But if there's interest in
> having this kind of tool in the tree, I can work on making it
> reviewable. Either way, I should be able to use it to double-check
> more complicated test cases.
>
> A while back [2], you were wondering whether our Bison implementation
> would be able to parse the PATTERN grammar directly. I think this tool
> proves that the answer is "yes", but PERMUTE in particular causes a
> shift/reduce conflict. To fix it, I applied the same precedence
> workaround that we use for CUBE and ROLLUP.
>
> Thanks again!
> --Jacob
>
> [1] https://github.com/jchampio/postgres/tree/dev/rpr
> [2] https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp

Attachment Content-Type Size
unknown_filename text/plain 741 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message PG Bug reporting form 2024-09-28 11:00:01 BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
Previous Message jian he 2024-09-28 01:48:00 Re: Add new COPY option REJECT_LIMIT