Re: Consider the number of columns in the sort cost model

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Consider the number of columns in the sort cost model
Date: 2024-10-15 05:15:13
Message-ID: CAApHDvpLKFoi-Yk__-3wWfUdd+KEnuREbn5ApiO+hET6WsVdaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> I am suspicious of that but open to hearing other opinions. The
> coefficient incorporates knowledge about how many comparisons will be
> made with this sorting operator. The caller can have some additional
> knowledge about that. For example, IncrementalSort knows about groups -
> each group has the same prefix, which means each tuple comparison will
> inevitably cause a comparison for each column from the prefix. Also, it
> may be knowledge about the number of distinct values for each column,
> etc. I'm not sure we should elaborate cost_sort a lot here.

I agree that cost_sort() likely needs a bit of work to try to bring it
closer to the current reality. There have been plenty of performance
improvements to sort over the years and I don't recall the costing
code ever being touched, likely that's because there are just so many
things that are not correctly or not costed at all.

As for your patch, I'm suspicious that the general idea you're
proposing is an actual improvement. I've only glanced at the patch,
but at least from this fragment here:

@@ -2150,9 +2156,12 @@ cost_sort(Path *path, PlannerInfo *root,
{
Cost startup_cost;
Cost run_cost;
+ int cmpMultiplier = npathkeys;

and:

static void
cost_tuplesort(Cost *startup_cost, Cost *run_cost,
- double tuples, int width,
+ double tuples, int width, int cmpMultiplier,
Cost comparison_cost, int sort_mem,
double limit_tuples)
{
@@ -1913,7 +1917,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
tuples = 2.0;

/* Include the default cost-per-comparison */
- comparison_cost += 2.0 * cpu_operator_cost;
+ comparison_cost += (1.0 + cmpMultiplier) * cpu_operator_cost;

it seems you're just charging cpu_operator_cost * <number of columns
to sort by>. It seems like it won't be very hard to fool that into
doing the wrong thing when the first column to sort by is distinct or
almost distinct. There's going to be far fewer or no tiebreaker
comparisons for that case.

As mentioned by Kirill, I also don't understand the cost_sort
signature change. Why would you do that over just doing
list_length(pathkeys) within cost_sort()? Your throwing away a
parameter that might be the most useful one of the bunch for allowing
better sort cost estimates.

I think maybe what is worth working on is seeing if you can better
estimate the number of tiebreak comparisons by checking the number of
distinct values for each sort key. That might require what we
discussed on another thread about having estimate_num_groups() better
use the n_distinct estimate from the EquivalenceMember with the least
distinct values. It'll be another question if all that can be done
and this all still perform well enough for cost_sort(). You may have
to write it first before we can figure that out. It may be possible
to buy back some of the overheads with some additional caching...
Perhaps that could be done within the EquivalenceClass.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-10-15 06:12:25 simplify regular expression locale global variables
Previous Message Andrei Lepikhov 2024-10-15 04:47:55 Re: Consider the number of columns in the sort cost model