Re: Consider the number of columns in the sort cost model

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Consider the number of columns in the sort cost model
Date: 2024-10-15 04:47:55
Message-ID: 4c093e3c-aa41-431e-9849-b79995cdf0e3@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/15/24 02:59, Kirill Reshke wrote:
> On Thu, 22 Aug 2024 at 23:46, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> Two questions:
Thank you for the interest to this feature!
> 1) Why do we change the `cost_sort` definition? We can calculate the
> length of `pathkeys` inside cost_sort if we want, just like we do
> inside cost_gather_merge.
I am suspicious of that but open to hearing other opinions. The
coefficient incorporates knowledge about how many comparisons will be
made with this sorting operator. The caller can have some additional
knowledge about that. For example, IncrementalSort knows about groups -
each group has the same prefix, which means each tuple comparison will
inevitably cause a comparison for each column from the prefix. Also, it
may be knowledge about the number of distinct values for each column,
etc. I'm not sure we should elaborate cost_sort a lot here.
>
> 2) I'm not very sure about the (1.0 + cmpMultiplier) coefficient. I
> understand that we need to multiply by something that is not 0 (and
> that's why we add 1), and this is strictly better than multiplying by
> 2, but why exactly is this formula like this, not (1.0 + cmpMultiplier
> ^ 2 / 10) for example? Maybe we need some more sophisticated approach?
Hmmm, I think you misunderstood the general idea behind this
coefficient. I explained it earlier this year in more detail [1],
showing the direction of future improvement.
Value '1' in this formula reflects some second-order significance
physics that we don't consider in that formula. Of course, it may be
different, but after tests, I decided to keep it simple.
The cmpMultiplier value reflects the number of comparisons the executor
will make when comparing a pair of tuples. It is a fraction of the
maximum number of potential comparisons for each operation of tuple
comparisons (likewise, Hash Agg already employs in costing). Right now,
we don't use distinct statistics on sorting columns and use a
conservative approach, which may be improved in the future. But I don't
think it is a real problem here because we use the same in all places
and, with this feature, change the balance only with hashing methods (in
favour of their usage). I think it is not harmful because this
characteristic is a second-order grade factor and shouldn't
significantly reduce the performance of actual queries.
>
> Also, this patch needs to be rebased, again.
Thanks, done.

[1] https://danolivo.substack.com/p/elaboration-of-the-postgresql-sort

--
regards, Andrei Lepikhov

Attachment Content-Type Size
v3-0001-Introduce-the-number-of-columns-in-the-cost-sort-.patch text/x-patch 19.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-10-15 05:15:13 Re: Consider the number of columns in the sort cost model
Previous Message Amul Sul 2024-10-15 04:33:38 Re: NOT ENFORCED constraint feature