From: | Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> |
---|---|
To: | jchampion(at)timescale(dot)com |
Cc: | er(at)xs4all(dot)nl, vik(at)postgresfriends(dot)org, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Row pattern recognition |
Date: | 2023-09-09 11:21:21 |
Message-ID: | 20230909.202121.983745512339833395.t-ishii@sranhm.sra.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
>> But:
>>
>> UP AS price > PREV(price)
>>
>> also depends on previous row, no?
>
> PREV(CLASSIFIER()) depends not on the value of the previous row but the
> state of the match so far. To take an example from the patch:
>
>> * Example:
>> * str_set[0] = "AB";
>> * str_set[1] = "AC";
>> * In this case at row 0 A and B are true, and A and C are true in row 1.
>
> With these str_sets and my example DEFINE, row[1] is only classifiable
> as 'A' if row[0] is *not* classified as 'A' at this point in the match.
> "AA" is not a valid candidate string, even if it matches the PATTERN.
Ok, Let me clarify my understanding. Suppose we have:
PATTER (A B)
DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
B AS price > 100
and the target table has price column values:
row[0]: 110
row[1]: 110
row[2]: 110
row[3]: 110
Then we will get for str_set:
r0: B
r1: AB
Because r0 only has classifier B, r1 can have A and B. Problem is,
r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then
r2 = AB. I guess this is the issue you pointed out.
> So if we don't reevaluate the pattern variable condition for the row, we
> at least have to prune the combinations that search_str_set() visits, so
> that we don't generate a logically impossible combination. That seems
> like it could be pretty fragile, and it may be difficult for us to prove
> compliance.
Yeah, probably we have delay evaluation of such pattern variables like
A, then reevaluate A after the first scan.
What about leaving this (reevaluation) for now? Because:
1) we don't have CLASSIFIER
2) we don't allow to give CLASSIFIER to PREV as its arggument
so I think we don't need to worry about this for now.
>> Can you explain a little bit more? I think 'aaa' matches a regular
>> expression 'a+(b|a)+' and should be no problem before "aab" is
>> considered.
>
> Assuming I've understood the rules correctly, we're not allowed to
> classify the last row as 'A' if it also matches 'B'. Lexicographic
> ordering takes precedence, so we have to try "aab" first. Otherwise our
> query could return different results compared to another implementation.
>
>>> Similarly, the assumption that we want to match the longest string only
>>> works because we don't allow alternation yet.
>>
>> Can you please clarify more on this?
>
> Sure: for the pattern
>
> PATTERN ( (A|B)+ )
>
> we have to consider the candidate "a" over the candidate "ba", even
> though the latter is longer. Like the prior example, lexicographic
> ordering is considered more important than the greedy quantifier.
> Quoting ISO/IEC 9075-2:2016:
>
> More precisely, with both reluctant and greedy quantifiers, the set
> of matches is ordered lexicographically, but when one match is an
> initial substring of another match, reluctant quantifiers prefer the
> shorter match (the substring), whereas greedy quantifiers prefer the
> longer match (the “superstring”).
>
> Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority.
> ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples.
>
> (The "lexicographic order matters more than greediness" concept was the
> most mind-bending part for me so far, probably because I haven't figured
> out how to translate the concept into POSIX EREs. It wouldn't make sense
> to say "the letter 't' can match 'a', 'B', or '3' in this regex", but
> that's what RPR is doing.)
Thanks for the explanation. Surprising concet of the standard:-) Is
it different from SIMILAR TO REs too?
What if we don't follow the standard, instead we follow POSIX EREs? I
think this is better for users unless RPR's REs has significant merit
for users.
>> Do you mean you want to provide a better patch for the pattern
>> matching part? That will be helpfull.
>
> No guarantees that I'll find a better patch :D But yes, I will give it a
> try.
Ok.
>> Because I am currently working
>> on the aggregation part and have no time to do it. However, the
>> aggregation work affects the v5 patch: it needs a refactoring. So can
>> you wait until I release v6 patch? I hope it will be released in two
>> weeks or so.
>
> Absolutely!
Thanks.
> Heh, I think it would be pretty foolish for me to code an NFA, from
> scratch, and then try to convince the community to maintain it.
>
> But:
> - I think we have to implement a parallel parser regardless (RPR PATTERN
> syntax looks incompatible with POSIX)
I am not sure if we need to worry about this because of the reason I
mentioned above.
> - I suspect we need more control over the backtracking than the current
> pg_reg* API is going to give us, or else I'm worried performance is
> going to fall off a cliff with usefully-large partitions
Agreed.
> - there's a lot of stuff in POSIX EREs that we don't need, and of the
> features we do need, the + quantifier is probably one of the easiest
> - it seems easier to prove the correctness of a slow, naive,
> row-at-a-time engine, because we can compare it directly to the spec
>
> So what I'm thinking is: if I start by open-coding the + quantifier, and
> slowly add more pieces in, then it might be easier to see the parts of
> src/backend/regex that I've duplicated. We can try to expose those parts
> directly from the internal API to replace my bad implementation. And if
> there are parts that aren't duplicated, then it'll be easier to explain
> why we need something different from the current engine.
>
> Does that seem like a workable approach? (Worst-case, my code is just
> horrible, and we throw it in the trash.)
Yes, it seems workable. I think for the first cut of RPR needs at
least the +quantifier with reasonable performance. The current naive
implementation seems to have issue because of exhaustive search.
Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-09-09 12:13:38 | Re: Suspicious redundant assignment in COPY FROM |
Previous Message | Alexander Lakhin | 2023-09-09 09:00:00 | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |