Re: Why don't we consider explicit Incremental Sort?

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why don't we consider explicit Incremental Sort?
Date: 2024-09-22 05:38:23
Message-ID: CAApHDvoSD=15=ec_ciZ6vX5cJzvn5hOBnmckTDZDkW55z2n3AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 20 Sept 2024 at 20:48, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> I agree with you that sometimes the definition of 'regression' can
> depend on when the alternative plans are introduced. Imagine if we
> initially did not have the 1.5x pessimism factor and introduced it
> later, it would be treated as a 'regression' because some queries that
> could benefit from incremental sort would then have to resort to full
> sort.

I think this is a good way of looking at it. I think it's a bad idea
when introducing new planner/executor abilities to penalise the costs
for that ability. It might make the committer sleep better at night
after committing some new feature, but it's just not a future-proof
endeavour. We should aim for our cost models to be as close to the
truth as possible. As soon as you start introducing "just in case"
penalties, you're telling lies. Lies don't work out well, especially
when compounded with other lies, which is exactly what you get when
layering Paths atop of Paths with just-in-case penalties added at each
level.

I was in this position with Memoize. The problem there is that when we
don't have any stats, we assume the n_distinct is 200. Memoize can
look quite attractive with such a small n_distinct estimate. I
invented SELFLAG_USED_DEFAULT to allow Memoize to steer clear when the
n_distinct estimate used the hard-coded default. I think that's an ok
thing to do as otherwise it could work out very badly if Nested Loop
-> Memoize was used instead of, say a Hash Join on a join problem with
millions of rows, all of them with distinct join values.

> So here is the v2 patch, which is almost the same as v1, but includes
> changes in test cases and comments, and also includes a commit
> message. I'll put it in the commitfest.

Just looking at the commit message:

> The rationale is based on the assumption that incremental sort is
> always faster than full sort when there are presorted keys, a premise
> that has been applied in various parts of the code. This assumption
> does not always hold, particularly in cases with a large skew in the
> number of rows within the presorted groups.

My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-09-22 05:56:59 Re: Using per-transaction memory contexts for storing decoded tuples
Previous Message Anton A. Melnikov 2024-09-22 04:55:11 Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.