Re: Consider the number of columns in the sort cost model

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Consider the number of columns in the sort cost model
Date: 2024-10-14 19:59:11
Message-ID: CALdSSPjkwPimt7SKs6J17TJFBtu7p1XfuPSeBQYL1QNdxbFS9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On Thu, 22 Aug 2024 at 23:46, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>
> Hi,
>
> I would like to propose a slight elaboration of the sort cost model.
> In practice, we frequently see the choice of suboptimal sortings, which
> slump performance by 10-50%.
>
> The key reason here is the optimiser's blindness to the fact that
> sorting calls a comparison operator for each pair of attributes in the
> tuple. Just for example:
>
> SET work_mem = '128MB';
> CREATE TABLE test (
> x1 int,x2 int,x3 int,x4 int,x5 int,x6 int,x7 int,x8 int,payload text);
> INSERT INTO test (x1,x2,x3,x4,x5,x6,x7,x8,payload) (
> SELECT 1,2,3,4,5,6,7,(1E5-x), 'abc'||x
> FROM generate_series(1,1E5) AS x);
> VACUUM ANALYZE test;
>
> Let's execute the following three queries:
>
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x8;
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x8;
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x4,x5,x6,x7,x8;
>
> My laptop executes these three queries in 100ms, 300ms, and 560ms,
> respectively. At the same time, the costs of these query plans are
> similar. This corner case shows that sorting matters and in the corner
> case, dependence on the number of comparisons looks close to linear.

Indeed. My numbers are 44.167, 88.759,136.864 ms:

> The patch in the attachment is a simplistic attempt to differentiate
> these sortings by the number of columns. It is not ideal because
> comparisons depend on the duplicates in each column, but it may be the
> subject of further work.
>
> Even such a trivial model allows us to resolve the case like the below:
> CREATE INDEX ON test (x1,x2,x3,x8);
> EXPLAIN (ANALYZE) SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x8;
>
> The current master will choose SeqScan for four columns as well as for a
> single column. With the model, the optimiser will understand that
> sorting four columns costs more than sorting a single column and switch
> to IndexScan.
>
> --
> regards, Andrei Lepikhov

I configm, before applying your patch Index Scan is not chosen,
whereas after it does.

Two questions:
1) Why do we change the `cost_sort` definition? We can calculate the
length of `patchkeys` inside cost_sort if we want, just like we do
inside cost_gather_merge. No need to make extension developers suffer
with API change, isn't it?

2) I'm not very sure about the (1.0 + cmpMultiplier) coefficient. I
understand that we need to multiply by something that is not 0 (and
that's why we add 1), and this is strictly better than multiplying by
2, but why exactly is this formula like this, not (1.0 + cmpMultiplier
^ 2 / 10) for example? Maybe we need some more sophisticated approach?

Also, this patch needs to be rebased, again.

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-10-14 20:02:22 Re: Large expressions in indexes can't be stored (non-TOASTable)
Previous Message Joel Jacobson 2024-10-14 19:59:01 Re: New "raw" COPY format