Re: Row pattern recognition

From: Vik Fearing <vik(at)postgresfriends(dot)org>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-06-26 22:38:20
Message-ID: 90bff6fa-01b7-cf60-f4f4-3412514a7cd2@postgresfriends.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/26/23 03:05, Tatsuo Ishii wrote:
>> I don't understand this. RPR in a window specification limits the
>> window to the matched rows, so this looks like your rpr() function is
>> just the regular first_value() window function that we already have?
>
> No, rpr() is different from first_value(). rpr() returns the argument
> value at the first row in a frame only when matched rows found. On the
> other hand first_value() returns the argument value at the first row
> in a frame unconditionally.
>
> company | tdate | price | rpr | first_value
> ----------+------------+-------+------+-------------
> company1 | 2023-07-01 | 100 | | 100
> company1 | 2023-07-02 | 200 | 200 | 200
> company1 | 2023-07-03 | 150 | 150 | 150
> company1 | 2023-07-04 | 140 | | 140
> company1 | 2023-07-05 | 150 | 150 | 150
> company1 | 2023-07-06 | 90 | | 90
> company1 | 2023-07-07 | 110 | | 110
> company1 | 2023-07-08 | 130 | | 130
> company1 | 2023-07-09 | 120 | | 120
> company1 | 2023-07-10 | 130 | | 130
>
> For example, a frame starting with (tdate = 2023-07-02, price = 200)
> consists of rows (price = 200, 150, 140, 150) satisfying the pattern,
> thus rpr returns 200. Since in this example frame option "ROWS BETWEEN
> CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts
> with (tdate = 2023-07-03, price = 150). This frame satisfies the
> pattern too (price = 150, 140, 150), and rpr retus 150... and so on.

Okay, I see the problem now, and why you need the rpr() function.

You are doing this as something that happens over a window frame, but it
is actually something that *reduces* the window frame. The pattern
matching needs to be done when the frame is calculated and not when any
particular function is applied over it.

This query (with all the defaults made explicit):

SELECT s.company, s.tdate, s.price,
FIRST_VALUE(s.tdate) OVER w,
LAST_VALUE(s.tdate) OVER w,
lowest OVER w
FROM stock AS s
WINDOW w AS (
PARTITION BY s.company
ORDER BY s.tdate
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
EXCLUDE NO OTHERS
MEASURES
LAST(DOWN) AS lowest
AFTER MATCH SKIP PAST LAST ROW
INITIAL PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);

Should produce this result:

company | tdate | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
company1 | 07-01-2023 | 100 | | |
company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140
company1 | 07-03-2023 | 150 | | |
company1 | 07-04-2023 | 140 | | |
company1 | 07-05-2023 | 150 | | |
company1 | 07-06-2023 | 90 | | |
company1 | 07-07-2023 | 110 | | |
company1 | 07-08-2023 | 130 | 07-05-2023 | 07-05-2023 | 120
company1 | 07-09-2023 | 120 | | |
company1 | 07-10-2023 | 130 | | |
(10 rows)

Or if we switch to AFTER MATCH SKIP TO NEXT ROW, then we get:

company | tdate | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
company1 | 07-01-2023 | 100 | | |
company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140
company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 | 140
company1 | 07-04-2023 | 140 | | |
company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90
company1 | 07-06-2023 | 90 | | |
company1 | 07-07-2023 | 110 | | |
company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 | 120
company1 | 07-09-2023 | 120 | | |
company1 | 07-10-2023 | 130 | | |
(10 rows)

And then if we change INITIAL to SEEK:

company | tdate | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
company1 | 07-01-2023 | 100 | 07-02-2023 | 07-05-2023 | 140
company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140
company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 | 140
company1 | 07-04-2023 | 140 | 07-05-2023 | 07-08-2023 | 90
company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90
company1 | 07-06-2023 | 90 | 07-08-2023 | 07-10-2023 | 120
company1 | 07-07-2023 | 110 | 07-08-2023 | 07-10-2023 | 120
company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 | 120
company1 | 07-09-2023 | 120 | | |
company1 | 07-10-2023 | 130 | | |
(10 rows)

Since the pattern recognition is part of the frame, the window
aggregates should Just Work.

>>> o SUBSET is not supported
>>
>> Is this because you haven't done it yet, or because you ran into
>> problems trying to do it?
>
> Because it seems SUBSET is not useful without MEASURES support. Thus
> my plan is, firstly implement MEASURES, then SUBSET. What do you
> think?

SUBSET elements can be used in DEFINE clauses, but I do not think this
is important compared to other features.

>>> Comments and suggestions are welcome.
>>
>> I have not looked at the patch yet, but is the reason for doing R020
>> before R010 because you haven't done the MEASURES clause yet?
>
> One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked
> harder for me because modifying main SELECT clause could be a hard
> work. Another reason is, I had no idea how to implement PREV/NEXT in
> other than in WINDOW clause. Other people might feel differently
> though.

I think we could do this with a single tuplesort if we use backtracking
(which might be really slow for some patterns). I have not looked into
it in any detail.

We would need to be able to remove tuples from the end (even if only
logically), and be able to update tuples inside the store. Both of
those needs come from backtracking and possibly changing the classifier.

Without backtracking, I don't see how we could do it without have a
separate tuplestore for every current possible match.

>> In any case, I will be watching this with a close eye, and I am eager
>> to help in any way I can.
>
> Thank you! I am looking forward to comments on my patch. Also any
> idea how to implement MEASURES clause is welcome.

I looked at your v2 patches a little bit and the only comment that I
currently have on the code is you spelled PERMUTE as PREMUTE.
Everything else is hopefully explained above.
--
Vik Fearing

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2023-06-26 23:47:44 Re: Fixing tab-complete for dollar-names
Previous Message Kirk Wolak 2023-06-26 22:26:42 Re: Do we want a hashset type?