| From: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> | 
|---|---|
| To: | Beena Emerson <memissemerson(at)gmail(dot)com> | 
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> | 
| Subject: | Re: [HACKERS] Runtime Partition Pruning | 
| Date: | 2017-12-16 06:05:57 | 
| Message-ID: | CAKJS1f-K=3d5MDASNYFJpUpc20xcBnAwNC1-AOeunhn0OtkWbQ@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 13 December 2017 at 00:33, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> PFA the updated patch, this can be applied over the v13 patches [1]
> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3
Hi Beena,
Thanks for posting an updated patch.
I've been looking over this and I think that the use of the
PartitionDispatch in set_append_subplan_indexes is not correct. What
we need here is the index of the Append's subnode and that's not what
RelationGetPartitionDispatchInfo() gives you. Remember that some
partitions could have been pruned away already during planning.
This quick example shows that the partition selection is not correct.
create table p (a int, b int) partition by range (a);
create table p_a_neg partition of p for values from (minvalue) to (0)
partition by range (b);
create table p_a_pos partition of p for values from (0) to (maxvalue)
partition by range (b);
create table p_a_neg_b_neg partition of p_a_neg for values from
(minvalue) to (0);
create table p_a_neg_b_pos partition of p_a_neg for values from (0) to
(maxvalue);
create table p_a_pos_b_neg partition of p_a_pos for values from
(minvalue) to (0);
create table p_a_pos_b_pos partition of p_a_pos for values from (0) to
(maxvalue);
prepare q1 (int, int) as select * from p where a = $1 and b = $1;
explain analyze execute q1 (-1,-1); -- this works.
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099
rows=0 loops=1)
   Runtime Partition Pruning: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
(actual time=0.023..0.023 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
(12 rows)
explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not.
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..175.60 rows=4 width=8) (actual
time=758996.359..758996.359 rows=0 loops=1)
   Runtime Partition Pruning: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
(actual time=0.056..0.056 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
(12 rows)
So, I started to look at what the best way to put this right might be.
I see that since Parallel Append was committed that the subnodes are
now sorted in cost order inside create_append_path(), so likely we'll
want to delay figuring out the subpath list indexes until after that's
done since sorting would scramble our index arrays. We could simply
look at the subpaths at the end of create_append_path() and create
some sort of new matrix type that can accept the output of Amit's
get_partitions_from_clauses() and translate that Bitmapset into the
subpath indexes (another Bitmapset). This will also need to work for
sub-partitions too, so this matrix must be some sort of tree that we
can descend into when we see that get_partitions_from_clauses returned
a bit for a sub-partition instead of a leaf-partition.
I bashed this idea around a bit and I came up with the attached. It's
very far from complete and in a very WIP state. I've not really done
anything to make the correct clause list available in nodeAppend.c
yet, but I think the code that's there is worthy of a look. I've not
done that much work on the new choose_next_subplan* functions in
nodeAppend.c. I just modified choose_next_subplan_locally to show how
this set of functions need to take into account the subnode bitmap set
of valid partitions to scan. Perhaps some special case is needed to
have these functions ignore the Bitmapset when runtime pruning is
disabled (perhaps a completely new set of the functions is needed to
support the selection of the next non-pruned partition). Although,
probably that can be debated a bit later as it's a fairly minor detail
for now.
My patch also lacks any means to extract the Params during
match_clauses_to_partkey(), or at least most of the cases. I've just
added 1 case there. I did this because I thought it was better to
extract the ParamIds rather than a bool to say we've matched params.
This way we can only reevaluate which subplans to look at on rescan of
an Append if and only if the params we actually care about have
changed. I've not given this part a huge amount of thought yet.
I'm a little unsure where to go from here. Obviously, this is quite a
major rewrite of your patch. The parts that I've got missing likely
can use quite a bit of the stuff you've already written, but that
needs some review. I wanted to post this now as I know you're busy
working on this to rebase it on parallel Append and to also address
Robert's concerns, which all seem valid to me.
What's the best way for us to coordinate our efforts on this? Maybe
you could look at my patch and sanity check it to ensure I'm not
taking anything in the wrong direction?
I also think that MergeAppend needs similar work, but I think it's
best to get the Append case working first, or perhaps that's another
patch...
Please find attached my patch, which is based directly on top of
Amit's faster partition pruning v14 [1], which I patched against
4034db215b9
[1] https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp
-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
| Attachment | Content-Type | Size | 
|---|---|---|
| runtime_prune_drowley_v1.patch | application/octet-stream | 27.1 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2017-12-16 07:23:06 | Re: [sqlsmith] Parallel worker executor crash on master | 
| Previous Message | Jeremy Finzel | 2017-12-16 04:50:07 | Re: Backfill bgworker Extension? |