Re: 回复: An implementation of multi-key sort

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Yao Wang <yao-yw(dot)wang(at)broadcom(dot)com>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Wang Yao <yaowangm(at)outlook(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Hongxu Ma <hongxu(dot)ma(at)broadcom(dot)com>
Subject: Re: 回复: An implementation of multi-key sort
Date: 2024-07-07 06:32:34
Message-ID: a3093ea7-ef66-4ae5-a041-273b73b05e80@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 04/07/2024 3:45 pm, Yao Wang wrote:
> Generally, the benefit of mksort is mainly from duplicated values and sort
> keys: the more duplicated values and sort keys are, the bigger benefit it
> gets.
...
> 1. Use distinct stats info of table to enable mksort
>
> It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct
> of a table via pg_statistics. Enable mksort only when it is less than a
> threshold.
>
> The hacked code works, which need to modify a couple of interfaces of
> optimizer. In addition, a complete solution should consider types and
> distinct values of all columns, which might be too complex, and the benefit
> seems not so big.

If mksort really provides advantage only when there are a lot of
duplicates (for prefix keys?) and of small fraction of duplicates there
is even some (small) regression
then IMHO taking in account in planner information about estimated
number of distinct values seems to be really important. What was a
problem with accessing this statistics and why it requires modification
of optimizer interfaces? There is `get_variable_numdistinct` function
which is defined and used only in selfuncs.c

Information about values distribution seems to be quite useful for 
choosing optimal sort algorithm. Not only for multi-key sort
optimization. For example if we know min.max value of sort key and it is
small, we can use O(N) algorithm for sorting. Also it can help to
estimate when TOP-N search is preferable.

Right now Posgres creates special path for incremental sort. I am not
sure if we also need to be separate path for mk-sort.
But IMHO if we need to change some optimizer interfaces to be able to
take in account statistic and choose preferred sort algorithm at
planning time, then it should be done.
If mksort can increase sort more than two times (for large number of
duplicates), it will be nice to take it in account when choosing optimal
plan.

Also in this case we do not need extra GUC for explicit enabling of
mksort. There are too many parameters for optimizer and adding one more
will make tuning more complex. So I prefer that decision is take buy
optimizer itself based on the available information, especially if
criteria seems to be obvious.

Best regards,
Konstantin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-07-07 06:40:46 dropping privileges on windows vs privileged accounts
Previous Message Andres Freund 2024-07-07 06:07:27 Re: tests fail on windows with default git settings