Re: POC: GROUP BY optimization

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, David Rowley <dgrowleyml(at)gmail(dot)com>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject: Re: POC: GROUP BY optimization
Date: 2024-05-27 12:41:25
Message-ID: CAPpHfdvbxK3nszdk6YXWpV7p+P+pGOQhpKTxJ3xME3uD+qZmXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On Tue, Apr 23, 2024 at 1:19 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > Thank you for the fixes you've proposed. I didn't look much into
> > details yet, but I think the main concern Tom expressed in [1] is
> > whether the feature is reasonable at all. I think at this stage the
> > most important thing is to come up with convincing examples showing
> > how huge performance benefits it could cause. I will return to this
> > later today and will try to provide some convincing examples.
>
> I took a fresh look at 0452b461b, and have the following thoughts:
> 1) Previously, preprocess_groupclause() tries to reorder GROUP BY
> clauses to match the required ORDER BY order. It only reorders if
> GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa.
> So, both of them need to be satisfied by one ordering. 0452b461b also
> tries to match GROUP BY clauses to ORDER BY clauses, but takes into
> account an incremental sort. Actually, instead of removing
> preprocess_groupclause(), we could just teach it to take incremental
> sort into account.
> 2) The get_useful_group_keys_orderings() function takes into account 3
> orderings of pathkeys and clauses: original order as written in GROUP
> BY, matching ORDER BY clauses as much as possible, and matching the
> input path as much as possible. Given that even before 0452b461b,
> preprocess_groupclause() could change the original order of GROUP BY
> clauses, so we don't need to distinguish the first two. We could just
> consider output of new preprocess_groupclause() taking into account an
> incremental sort and the ordering matching the input path as much as
> possible. This seems like significant simplification.
>
> Let me also describe my thoughts about the justification of the
> feature itself. As Tom pointed upthread, Sort + Grouping is generally
> unlikely faster than Hash Aggregate. The main point of this feature
> is being able to match the order of GROUP BY clauses to the order of
> the input path. That input path could be Index Scan or Subquery.
> Let's concentrate on Index Scan. Index Scan can give us the required
> order, so we can do grouping without Sort or with significantly
> cheaper Incremental Sort. That could be much faster than Hash
> Aggregate. But if we scan the full table (or its significant
> fraction), Index Scan is typically much more expensive than Sequential
> Scan because of random page access. However, there are cases when
> Index Scan could be cheaper.
> 1) If the heap row is wide and the index contains all the required
> columns, Index Only Scan can be cheaper than Sequential Scan because
> of lesser volume.
> 2) If the query predicate is selective enough and matches the index,
> Index Scan might be significantly cheaper. One may say that if the
> query predicate is selective enough then there are not that many
> matching rows, so aggregating them either way isn't a big deal anyway.
> However, given that data volumes are growing tremendously, it's not
> hard to imagine that the index selected a small fraction of table
> rows, but they are still enough to be costly for aggregating.
> Therefore, I think there are significant cases where matching GROUP BY
> clauses to the order of the input path could give a substantial
> improvement over Hash Aggregate.
>
> While there are some particular use-cases by Jian He, I hope that
> above could give some rationale.

I've assembled patches in this thread into one patchset.
0001 The patch fixing asymmetry in setting EquivalenceClass.ec_sortref
by Andrei [1]. I've revised comments and wrote the commit message.
0002 The patch for handling duplicates of SortGroupClause. I didn't
get the sense of Andrei implementation. It seems to care about
duplicate pointers in group clauses list. But the question is the
equal SortGroupClause's comprising different pointers. I think we
should group duplicate SortGroupClause's together as
preprocess_groupclause() used to do. Reimplemented patch to do so.
0003 Rename PathKeyInfo to GroupByOrdering by Andres [3]. I only
revised comments and wrote the commit message.
0004 Turn back preprocess_groupclause() for the reason I described upthread [4].

Any thoughts?

Links.
1. https://www.postgresql.org/message-id/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru
2. https://www.postgresql.org/message-id/a663f0f6-cbf6-49aa-af2e-234dc6768a07%40postgrespro.ru
3. https://www.postgresql.org/message-id/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru
4. https://www.postgresql.org/message-id/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v2-0003-Rename-PathKeyInfo-to-GroupByOrdering.patch application/octet-stream 7.0 KB
v2-0004-Restore-preprocess_groupclause.patch application/octet-stream 13.1 KB
v2-0002-Teach-group_keys_reorder_by_pathkeys-about-redund.patch application/octet-stream 4.2 KB
v2-0001-Fix-asymmetry-in-setting-EquivalenceClass.ec_sort.patch application/octet-stream 8.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-05-27 13:10:13 Re: relfilenode statistics
Previous Message Ranier Vilela 2024-05-27 12:25:45 Fix calloc check if oom (PQcancelCreate)