Quick Links

Re: POC: GROUP BY optimization

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, David Rowley <dgrowleyml(at)gmail(dot)com>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject:	Re: POC: GROUP BY optimization
Date:	2024-05-27 12:41:25
Message-ID:	CAPpHfdvbxK3nszdk6YXWpV7p+P+pGOQhpKTxJ3xME3uD+qZmXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

On Tue, Apr 23, 2024 at 1:19 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > Thank you for the fixes you've proposed. I didn't look much into
> > details yet, but I think the main concern Tom expressed in [1] is
> > whether the feature is reasonable at all. I think at this stage the
> > most important thing is to come up with convincing examples showing
> > how huge performance benefits it could cause. I will return to this
> > later today and will try to provide some convincing examples.
>
> I took a fresh look at 0452b461b, and have the following thoughts:
> 1) Previously, preprocess_groupclause() tries to reorder GROUP BY
> clauses to match the required ORDER BY order. It only reorders if
> GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa.
> So, both of them need to be satisfied by one ordering. 0452b461b also
> tries to match GROUP BY clauses to ORDER BY clauses, but takes into
> account an incremental sort. Actually, instead of removing
> preprocess_groupclause(), we could just teach it to take incremental
> sort into account.
> 2) The get_useful_group_keys_orderings() function takes into account 3
> orderings of pathkeys and clauses: original order as written in GROUP
> BY, matching ORDER BY clauses as much as possible, and matching the
> input path as much as possible. Given that even before 0452b461b,
> preprocess_groupclause() could change the original order of GROUP BY
> clauses, so we don't need to distinguish the first two. We could just
> consider output of new preprocess_groupclause() taking into account an
> incremental sort and the ordering matching the input path as much as
> possible. This seems like significant simplification.
>
> Let me also describe my thoughts about the justification of the
> feature itself. As Tom pointed upthread, Sort + Grouping is generally
> unlikely faster than Hash Aggregate. The main point of this feature
> is being able to match the order of GROUP BY clauses to the order of
> the input path. That input path could be Index Scan or Subquery.
> Let's concentrate on Index Scan. Index Scan can give us the required
> order, so we can do grouping without Sort or with significantly
> cheaper Incremental Sort. That could be much faster than Hash
> Aggregate. But if we scan the full table (or its significant
> fraction), Index Scan is typically much more expensive than Sequential
> Scan because of random page access. However, there are cases when
> Index Scan could be cheaper.
> 1) If the heap row is wide and the index contains all the required
> columns, Index Only Scan can be cheaper than Sequential Scan because
> of lesser volume.
> 2) If the query predicate is selective enough and matches the index,
> Index Scan might be significantly cheaper. One may say that if the
> query predicate is selective enough then there are not that many
> matching rows, so aggregating them either way isn't a big deal anyway.
> However, given that data volumes are growing tremendously, it's not
> hard to imagine that the index selected a small fraction of table
> rows, but they are still enough to be costly for aggregating.
> Therefore, I think there are significant cases where matching GROUP BY
> clauses to the order of the input path could give a substantial
> improvement over Hash Aggregate.
>
> While there are some particular use-cases by Jian He, I hope that
> above could give some rationale.

I've assembled patches in this thread into one patchset.
0001 The patch fixing asymmetry in setting EquivalenceClass.ec_sortref
by Andrei [1]. I've revised comments and wrote the commit message.
0002 The patch for handling duplicates of SortGroupClause. I didn't
get the sense of Andrei implementation. It seems to care about
duplicate pointers in group clauses list. But the question is the
equal SortGroupClause's comprising different pointers. I think we
should group duplicate SortGroupClause's together as
preprocess_groupclause() used to do. Reimplemented patch to do so.
0003 Rename PathKeyInfo to GroupByOrdering by Andres [3]. I only
revised comments and wrote the commit message.
0004 Turn back preprocess_groupclause() for the reason I described upthread [4].

Any thoughts?

Links.
1. https://www.postgresql.org/message-id/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru
2. https://www.postgresql.org/message-id/a663f0f6-cbf6-49aa-af2e-234dc6768a07%40postgrespro.ru
3. https://www.postgresql.org/message-id/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru
4. https://www.postgresql.org/message-id/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachment	Content-Type	Size
v2-0003-Rename-PathKeyInfo-to-GroupByOrdering.patch	application/octet-stream	7.0 KB
v2-0004-Restore-preprocess_groupclause.patch	application/octet-stream	13.1 KB
v2-0002-Teach-group_keys_reorder_by_pathkeys-about-redund.patch	application/octet-stream	4.2 KB
v2-0001-Fix-asymmetry-in-setting-EquivalenceClass.ec_sort.patch	application/octet-stream	8.8 KB

In response to

Re: POC: GROUP BY optimization at 2024-04-22 22:19:59 from Alexander Korotkov

Responses

About 0001:,Having overviewed it, I don't see any issues (but I'm the author), except grammatical ones - but I'm not a native to judge it.,Also, the sentence 'turning GROUP BY clauses into pathkeys' is unclear to me. It may be better to write something like: 'building pathkeys by the list of grouping clauses'.,,0002:,The part under USE_ASSERT_CHECKING looks good to me. But the code in group_keys_reorder_by_pathkeys looks suspicious: of course, we do some doubtful work without any possible way to reproduce, but if we envision some duplicated elements in the group_clauses, we should avoid usage of the list_concat_unique_ptr. What's more, why do you not exit from foreach_ptr immediately after SortGroupClause has been found? I think the new_group_clauses should be consistent with the new_group_pathkeys.,,0003:,Looks good,,0004:,I was also thinking about reintroducing the preprocess_groupclause because with the re-arrangement of GROUP-BY clauses according to incoming pathkeys, it doesn't make sense to have a user-defined order—at least while cost_sort doesn't differ costs for alternative column orderings.,So, I'm okay with the code. But why don't you use the same approach with foreach_ptr as before? at 2024-05-29 08:08:26 from Andrei Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2024-05-27 13:10:13	Re: relfilenode statistics
Previous Message	Ranier Vilela	2024-05-27 12:25:45	Fix calloc check if oom (PQcancelCreate)