From: | Jacob Champion <jchampion(at)timescale(dot)com> |
---|---|
To: | Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Vik Fearing <vik(at)postgresfriends(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Row pattern recognition |
Date: | 2023-09-08 00:00:07 |
Message-ID: | fcf6dcfa-5a61-00ea-4311-d3d003b790ed@timescale.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello!
> (1) I completely changed the pattern matching engine so that it
> performs backtracking. Now the engine evaluates all pattern elements
> defined in PATTER against each row, saving matched pattern variables
> in a string per row. For example if the pattern element A and B
> evaluated to true, a string "AB" is created for current row.
If I understand correctly, this strategy assumes that one row's
membership in a pattern variable is independent of the other rows'
membership. But I don't think that's necessarily true:
DEFINE
A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
...
> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
> in nodeWindowAggs.c for more details. For now I use a naive depth
> first search and probably there is a lot of rooms for optimization
> (for example rewriting it without using
> recursion).
The depth-first match is doing a lot of subtle work here. For example, with
PATTERN ( A+ B+ )
DEFINE A AS TRUE,
B AS TRUE
(i.e. all rows match both variables), and three rows in the partition,
our candidates will be tried in the order
aaa
aab
aba
abb
...
bbb
The only possible matches against our regex `^a+b+` are "aab" and "abb",
and that order matches the preferment order, so it's fine. But it's easy
to come up with a pattern where that's the wrong order, like
PATTERN ( A+ (B|A)+ )
Now "aaa" will be considered before "aab", which isn't correct.
Similarly, the assumption that we want to match the longest string only
works because we don't allow alternation yet.
> Suggestions/patches are welcome.
Cool, I will give this piece some more thought. Do you mind if I try to
add some more complicated pattern quantifiers to stress the
architecture, or would you prefer to tackle that later? Just alternation
by itself will open up a world of corner cases.
> With the new engine, cases above do not fail anymore. See new
> regression test cases. Thanks for providing valuable test cases!
You're very welcome!
On 8/9/23 01:41, Tatsuo Ishii wrote:
> - I found a bug with pattern matching code. It creates a string for
> subsequent regular expression matching. It uses the initial letter
> of each define variable name. For example, if the varname is "foo",
> then "f" is used. Obviously this makes trouble if we have two or
> more variables starting with same "f" (e.g. "food"). To fix this, I
> assign [a-z] to each variable instead of its initial letter. However
> this way limits us not to have more than 26 variables. I hope 26 is
> enough for most use cases.
There are still plenty of alphanumerics left that could be assigned...
But I'm wondering if we might want to just implement the NFA directly?
The current implementation's Cartesian explosion can probably be pruned
aggressively, but replaying the entire regex match once for every
backtracked step will still duplicate a lot of work.
--
I've attached another test case; it looks like last_value() is depending
on some sort of side effect from either first_value() or nth_value(). I
know the window frame itself is still under construction, so apologies
if this is an expected failure.
Thanks!
--Jacob
Attachment | Content-Type | Size |
---|---|---|
test-last_value.diff.txt | text/plain | 2.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-09-08 00:07:13 | Re: Impact of checkpointer during pg_upgrade |
Previous Message | Michael Paquier | 2023-09-07 23:48:12 | Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~? |