From: | Richard Guo <riguo(at)pivotal(dot)io> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel grouping sets |
Date: | 2019-07-31 08:06:30 |
Message-ID: | CAN_9JTyPnkHuZ8ruWCwW-QiCefntONaE+E0w7H63hw2ywyh-4w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo(at)pivotal(dot)io> wrote:
> >
> >> Hi all,
> >>
> >> Paul and I have been hacking recently to implement parallel grouping
> >> sets, and here we have two implementations.
> >>
> >> Implementation 1
> >> ================
> >>
> >> Attached is the patch and also there is a github branch [1] for this
> >> work.
> >>
> >
> >Rebased with the latest master.
> >
>
> Hi Richard,
>
> thanks for the rebased patch. I think the patch is mostly fine (at least I
> don't see any serious issues). A couple minor comments:
>
Hi Tomas,
Thank you for reviewing this patch.
>
> 1) I think get_number_of_groups() would deserve a short explanation why
> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
> same branch. Before these cases were clearly separated, now it seems a bit
> mixed up and it may not be immediately obvious why it's OK.
>
Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.
>
> 2) There are new regression tests, but they are not added to any schedule
> (parallel or serial), and so are not executed as part of "make check". I
> suppose this is a mistake.
>
Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.
>
> 3) The regression tests do check plan and results like this:
>
> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
> SELECT ... ORDER BY 1, 2, 3;
>
> which however means that the query might easily use a different plan than
> what's verified in the eplain (thanks to the additional ORDER BY clause).
> So I think this should explain and execute the same query.
>
> (In this case the plans seems to be the same, but that may easily change
> in the future, and we could miss it here, failing to verify the results.)
>
Thank you for pointing this out. Fixed it in V4 patch.
>
> 4) It might be a good idea to check the negative case too, i.e. a query on
> data set that we should not parallelize (because the number of partial
> groups would be too high).
>
Yes, agree. Added a negative case.
>
>
> Do you have any plans to hack on the second approach too? AFAICS those two
> approaches are complementary (address different data sets / queries), and
> it would be nice to have both. One of the things I've been wondering is if
> we need to invent gset_id as a new concept, or if we could simply use the
> existing GROUPING() function - that uniquely identifies the grouping set.
>
>
Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach. I agree with you that we can try GROUPING()
function to see if it can replace gset_id.
Thanks
Richard
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Implementing-parallel-grouping-sets.patch | application/octet-stream | 24.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2019-07-31 08:17:09 | Re: Problem with default partition pruning |
Previous Message | Amit Langote | 2019-07-31 08:04:38 | Re: partition routing layering in nodeModifyTable.c |