From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
---|---|
To: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Consider the number of columns in the sort cost model |
Date: | 2024-10-14 19:59:11 |
Message-ID: | CALdSSPjkwPimt7SKs6J17TJFBtu7p1XfuPSeBQYL1QNdxbFS9w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
On Thu, 22 Aug 2024 at 23:46, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>
> Hi,
>
> I would like to propose a slight elaboration of the sort cost model.
> In practice, we frequently see the choice of suboptimal sortings, which
> slump performance by 10-50%.
>
> The key reason here is the optimiser's blindness to the fact that
> sorting calls a comparison operator for each pair of attributes in the
> tuple. Just for example:
>
> SET work_mem = '128MB';
> CREATE TABLE test (
> x1 int,x2 int,x3 int,x4 int,x5 int,x6 int,x7 int,x8 int,payload text);
> INSERT INTO test (x1,x2,x3,x4,x5,x6,x7,x8,payload) (
> SELECT 1,2,3,4,5,6,7,(1E5-x), 'abc'||x
> FROM generate_series(1,1E5) AS x);
> VACUUM ANALYZE test;
>
> Let's execute the following three queries:
>
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x8;
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x8;
> EXPLAIN (ANALYZE)
> SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x4,x5,x6,x7,x8;
>
> My laptop executes these three queries in 100ms, 300ms, and 560ms,
> respectively. At the same time, the costs of these query plans are
> similar. This corner case shows that sorting matters and in the corner
> case, dependence on the number of comparisons looks close to linear.
Indeed. My numbers are 44.167, 88.759,136.864 ms:
> The patch in the attachment is a simplistic attempt to differentiate
> these sortings by the number of columns. It is not ideal because
> comparisons depend on the duplicates in each column, but it may be the
> subject of further work.
>
> Even such a trivial model allows us to resolve the case like the below:
> CREATE INDEX ON test (x1,x2,x3,x8);
> EXPLAIN (ANALYZE) SELECT * FROM test WHERE x1<9E4 ORDER BY x1,x2,x3,x8;
>
> The current master will choose SeqScan for four columns as well as for a
> single column. With the model, the optimiser will understand that
> sorting four columns costs more than sorting a single column and switch
> to IndexScan.
>
> --
> regards, Andrei Lepikhov
I configm, before applying your patch Index Scan is not chosen,
whereas after it does.
Two questions:
1) Why do we change the `cost_sort` definition? We can calculate the
length of `patchkeys` inside cost_sort if we want, just like we do
inside cost_gather_merge. No need to make extension developers suffer
with API change, isn't it?
2) I'm not very sure about the (1.0 + cmpMultiplier) coefficient. I
understand that we need to multiply by something that is not 0 (and
that's why we add 1), and this is strictly better than multiplying by
2, but why exactly is this formula like this, not (1.0 + cmpMultiplier
^ 2 / 10) for example? Maybe we need some more sophisticated approach?
Also, this patch needs to be rebased, again.
--
Best regards,
Kirill Reshke
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-10-14 20:02:22 | Re: Large expressions in indexes can't be stored (non-TOASTable) |
Previous Message | Joel Jacobson | 2024-10-14 19:59:01 | Re: New "raw" COPY format |