From: | Peter Moser <pitiz29a(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] [PROPOSAL] Temporal query processing with range types |
Date: | 2018-01-26 06:55:11 |
Message-ID: | CAHO0eLZ--LPCqaMpH1f6eKQZWNoU-BjYQ1p-Kj7VHJ=01vpsCQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2017-11-30 at 12:26 -0500, Robert Haas wrote:
> I wonder if we could instead think about R NORMALIZE S ON R.x = S.y
> WITH (R.time, S.time) as an ordinary join on R.x = S.y with some
> extra processing afterwards. After finding all the join partners for
> a row in R, extract all the lower and upper bounds from the S.time
> fields of the join partners and use that to emit as many rows from R
> as may be necessary.
We have now a new approach to plan and execute NORMALIZE as a special
join node of type NORMALIZE, an append-plan on the inner join path,
and a merge-join executor. For the latter, we would need to
extend nodeMergejoin.c with an point-in-range-containment join.
That is, we create a new join path within sort_inner_and_outer
(joinpath.c). First two projection nodes to extract the start- and
end-timepoints of the range type used as interval, and above an
append-plan to merge both subplans. In detail, outer_path is just sort,
whereas inner_path is append of (B, ts) projection with (B, te)
projection.
Hereby, B is a set of non-temporal attributes used in join equality
predicates, and [ts,te) forms the valid-time interval. Non-equality
predicates must be handled separately as a filter step.
Do you think, it is a good idea to extend the sort_inner_and_outer()
with a new branch, where jointype == NORMALIZE and add the projection
and append sub-paths there?
> The main problem that I see with that approach is that it
> seems desirable to separate the extra-processing-afterwards step into
> a separate executor node, and there's no way for a separate type of
> plan node to know when the join advances the outer side of the join.
> That's the purpose that row_id() is serving as you have it; we'd have
> to invent some other kind of signalling mechanism, which might not be
> entirely trivial. :-(
One advantage of the new approach is that row_id() is no longer needed.
The executor knows that it is inside a new "group" after every NEXTOUTER,
then the whole loop of the inner relation has the same row_id. The whole
mechanism could be handled through the MergeJoinState struct.
> If we could do it, though, it might be quite a bit more efficient,
> because it would avoid scanning S twice and performing a UNION on the
> results of those scans. Also, we wouldn't necessarily need to sort
> the whole set of rows from R, which I suspect is unavoidable in your
> implementation. We'd just need to sort the individual groups of rows
> from S, and my guess is that in many practical cases those groups are
> fairly small.
We would need a special SEQSCAN node/executor, that does the projection
and append steps in a single read. Do you have any suggestions how to
implement this?
> I wonder what identities hold for NORMALIZE. It does not seem to be
> commutative, but it looks like it might be associative... i.e. (R
> NORMALIZE S) NORMALIZE T produces the same output as R NORMALIZE (S
> NORMALIZE T), perhaps?
It is not commutative and also not associative.
Assume relations that contain one tuple each as follows:
R={[1, 100)}, S={[50, 100)}, and T={[10, 20)}
(R NORMALIZE S) NORMALIZE T gives {[1, 10), [10, 20), [20,50), [50, 100)}
while
R NORMALIZE (S NORMALIZE T) gives {[1, 50), [50, 100)}
Best regards,
Anton, Johann, Michael, Peter
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Moser | 2018-01-26 06:56:23 | Re: [HACKERS] [PROPOSAL] Temporal query processing with range types |
Previous Message | Peter Geoghegan | 2018-01-26 06:30:50 | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |