Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: jchampion(at)timescale(dot)com
Cc: er(at)xs4all(dot)nl, vik(at)postgresfriends(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-09-09 11:21:21
Message-ID: 20230909.202121.983745512339833395.t-ishii@sranhm.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

>> But:
>>
>> UP AS price > PREV(price)
>>
>> also depends on previous row, no?
>
> PREV(CLASSIFIER()) depends not on the value of the previous row but the
> state of the match so far. To take an example from the patch:
>
>> * Example:
>> * str_set[0] = "AB";
>> * str_set[1] = "AC";
>> * In this case at row 0 A and B are true, and A and C are true in row 1.
>
> With these str_sets and my example DEFINE, row[1] is only classifiable
> as 'A' if row[0] is *not* classified as 'A' at this point in the match.
> "AA" is not a valid candidate string, even if it matches the PATTERN.

Ok, Let me clarify my understanding. Suppose we have:

PATTER (A B)
DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
B AS price > 100

and the target table has price column values:

row[0]: 110
row[1]: 110
row[2]: 110
row[3]: 110

Then we will get for str_set:
r0: B
r1: AB

Because r0 only has classifier B, r1 can have A and B. Problem is,
r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then
r2 = AB. I guess this is the issue you pointed out.

> So if we don't reevaluate the pattern variable condition for the row, we
> at least have to prune the combinations that search_str_set() visits, so
> that we don't generate a logically impossible combination. That seems
> like it could be pretty fragile, and it may be difficult for us to prove
> compliance.

Yeah, probably we have delay evaluation of such pattern variables like
A, then reevaluate A after the first scan.

What about leaving this (reevaluation) for now? Because:

1) we don't have CLASSIFIER
2) we don't allow to give CLASSIFIER to PREV as its arggument

so I think we don't need to worry about this for now.

>> Can you explain a little bit more? I think 'aaa' matches a regular
>> expression 'a+(b|a)+' and should be no problem before "aab" is
>> considered.
>
> Assuming I've understood the rules correctly, we're not allowed to
> classify the last row as 'A' if it also matches 'B'. Lexicographic
> ordering takes precedence, so we have to try "aab" first. Otherwise our
> query could return different results compared to another implementation.
>
>>> Similarly, the assumption that we want to match the longest string only
>>> works because we don't allow alternation yet.
>>
>> Can you please clarify more on this?
>
> Sure: for the pattern
>
> PATTERN ( (A|B)+ )
>
> we have to consider the candidate "a" over the candidate "ba", even
> though the latter is longer. Like the prior example, lexicographic
> ordering is considered more important than the greedy quantifier.
> Quoting ISO/IEC 9075-2:2016:
>
> More precisely, with both reluctant and greedy quantifiers, the set
> of matches is ordered lexicographically, but when one match is an
> initial substring of another match, reluctant quantifiers prefer the
> shorter match (the substring), whereas greedy quantifiers prefer the
> longer match (the “superstring”).
>
> Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority.
> ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples.
>
> (The "lexicographic order matters more than greediness" concept was the
> most mind-bending part for me so far, probably because I haven't figured
> out how to translate the concept into POSIX EREs. It wouldn't make sense
> to say "the letter 't' can match 'a', 'B', or '3' in this regex", but
> that's what RPR is doing.)

Thanks for the explanation. Surprising concet of the standard:-) Is
it different from SIMILAR TO REs too?

What if we don't follow the standard, instead we follow POSIX EREs? I
think this is better for users unless RPR's REs has significant merit
for users.

>> Do you mean you want to provide a better patch for the pattern
>> matching part? That will be helpfull.
>
> No guarantees that I'll find a better patch :D But yes, I will give it a
> try.

Ok.

>> Because I am currently working
>> on the aggregation part and have no time to do it. However, the
>> aggregation work affects the v5 patch: it needs a refactoring. So can
>> you wait until I release v6 patch? I hope it will be released in two
>> weeks or so.
>
> Absolutely!

Thanks.

> Heh, I think it would be pretty foolish for me to code an NFA, from
> scratch, and then try to convince the community to maintain it.
>
> But:
> - I think we have to implement a parallel parser regardless (RPR PATTERN
> syntax looks incompatible with POSIX)

I am not sure if we need to worry about this because of the reason I
mentioned above.

> - I suspect we need more control over the backtracking than the current
> pg_reg* API is going to give us, or else I'm worried performance is
> going to fall off a cliff with usefully-large partitions

Agreed.

> - there's a lot of stuff in POSIX EREs that we don't need, and of the
> features we do need, the + quantifier is probably one of the easiest
> - it seems easier to prove the correctness of a slow, naive,
> row-at-a-time engine, because we can compare it directly to the spec
>
> So what I'm thinking is: if I start by open-coding the + quantifier, and
> slowly add more pieces in, then it might be easier to see the parts of
> src/backend/regex that I've duplicated. We can try to expose those parts
> directly from the internal API to replace my bad implementation. And if
> there are parts that aren't duplicated, then it'll be easier to explain
> why we need something different from the current engine.
>
> Does that seem like a workable approach? (Worst-case, my code is just
> horrible, and we throw it in the trash.)

Yes, it seems workable. I think for the first cut of RPR needs at
least the +quantifier with reasonable performance. The current naive
implementation seems to have issue because of exhaustive search.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-09-09 12:13:38 Re: Suspicious redundant assignment in COPY FROM
Previous Message Alexander Lakhin 2023-09-09 09:00:00 Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)