From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Oliver Ford <ojford(at)gmail(dot)com> |
Cc: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Add RANGE with values and exclusions clauses to the Window Functions |
Date: | 2018-02-04 18:10:09 |
Message-ID: | 17947.1517767809@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Oliver Ford <ojford(at)gmail(dot)com> writes:
> [ 0001-window-frame-v13.patch ]
I've been hacking on this all week (with breaks for release notes) and
have gotten it into a state that I think is close to committable.
There was quite a lot I didn't like about the patch initially, notably
that the interaction with operator classes/families was done all wrong.
The idea is to add one support function per opclass, not jam them all
into one opclass that breaks every rule for B-tree opclasses. For one
reason, with this approach there's no chance of dealing with non-default
sort orders imposed by non-default opclasses. (As a concrete example,
suppose that we have two btree opclasses for complex numbers, one that
sorts by real part and one that sorts by imaginary part. You can write
a well-defined in_range function for each of these, but one of them has
to increment the real part and the other the imaginary part.) I whacked
that around and also wrote the missing documentation for the API spec
for in_range functions. The path of least resistance was to dump it
into the nbtree/README text file, which I'm not that satisfied with;
probably it should go in the main SGML docs, but I did not find a good
place to put it.
I also really didn't like the implementation you'd chosen in
nodeWindowAgg.c to scan the entire partition and build an array of peer
group lengths. That risks running OOM with a large partition. Even if
the array doesn't lead to OOM, the tuplestore will spill to disk with
nasty performance consequences. We should try to ensure that the
tuplestore needn't get larger than the frame, so that well-written queries
with narrow frames can execute without spilling to disk. So I rewrote
that using an idea that had been speculated about in the original
comments, but nobody had gotten to yet: add some more read pointers to
track the frame boundaries, and advance them as needed. I'm not really
sure if this ends up as more or few row comparisons than the other way,
but in any case it uses a fixed amount of memory, which is good.
Also, the last patch's reimplementation of WinGetFuncArgInFrame isn't
right: AFAICS, it results in any "relpos" that would point to a row
in the exclusion range returning the row just after/before that range,
which is already wrong if the exclusion range is more than one row,
plus it doesn't renumber the rows beyond the exclusion. The behavior
we want is that the frame rows surviving after exclusion should appear
consecutively numbered. (This could be exposed with some tests using
nth_value.) I think the attached rewrite gets this right. Also, punting
entirely on the set-mark problem for SEEK_TAIL cases doesn't satisfy me,
for the same reason as above that we don't want the tuplestore to bloat.
What I did below is to set the mark at the frame start, which at least
gives an opportunity for efficient queries.
I hacked around on various other things too, for instance the behavior
for null values in RANGE mode didn't seem to be per spec.
I'm reasonably happy with all the code now, though surely it could use
another look by someone else. I've not yet reviewed the docs (other than
the implementor-oriented details I added), nor have I really looked at the
test cases. I do have a couple suggestions on the test cases: for one,
rather than duplicating the same window definition N times in each query,
use one WINDOW clause and reference it with "OVER windowname". Also,
adding a bunch of columns of different types to a single table seems like
a messy and not easily extensible way of testing different data types.
I'd suggest leaving the existing table alone and adding a new test table
per additional data type you want to test, so that there's an easy
template for testing future additions of more in_range support.
BTW, something I've not done here but am strongly tempted to do is
run around and change all the uses of "RANGE value PRECEDING/FOLLOWING"
terminology to, say, "RANGE offset PRECEDING/FOLLOWING". "value" is
just way too generic a term for this situation, making documentation
confusing, plus you end up contorting sentences to avoid constructions
like "value of the value". I'm not wedded to "offset" if somebody's got a
better word, but let's try to pick something more specific than "value".
(In the ROWS and GROUPS cases, maybe write "count"? Not entirely sure
what to do for text that's trying to address all three cases, though.)
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
0001-window-frame-v14.patch | text/x-diff | 255.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-02-04 18:11:45 | Re: pgsql: Support parallel btree index builds. |
Previous Message | Peter Geoghegan | 2018-02-04 17:58:05 | Re: pgsql: Support parallel btree index builds. |