Re: Postgres picks suboptimal index after building of an extended statistics

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgres picks suboptimal index after building of an extended statistics
Date: 2023-12-18 13:29:35
Message-ID: CAPpHfdt_AFo9SQkD2JooH3CTR9e+6fnEUZy_hGV5ZcEszHirCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

I'd like to get this subject off the ground. The problem originally
described in [1] obviously comes from wrong selectivity estimation.
"Dependencies" extended statistics lead to significant selectivity miss
24/1000 instead of 1/1000. When the estimation is correct, the PostgreSQL
optimizer is capable of choosing the appropriate unique index for the query.

Tom pointed out in [2] that this might be a problem of "Dependencies"
extended statistics. I've rechecked this. The dataset consists of two
parts. The first part defined in the CREATE TABLE statement contains
independent values. The second part defined in the INSERT statement
contains dependent values. "Dependencies" extended statistics estimate
dependency degree as a fraction of rows containing dependent values.
According to this definition, it correctly estimates the dependency degree
as about 0.5 for all the combinations. So, the error in the estimate comes
from the synergy of two factors: MCV estimation of the frequency of
individual values, and spreading of average dependency degree for those
values, which is not relevant for them. I don't think there is a way to
fix "dependencies" extended statistics because it works exactly as
designed. I have to note that instead of fixing "dependencies" extended
statistics you can just add multi-column MCV statistics, which would fix
the problem.

CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;

Independently on how well our statistics work, it looks pitiful that we
couldn't fix that using the knowledge of unique constraints on the table.
Despite statistics, which give us just more or less accurate estimates, the
constraint is something we really enforce and thus can rely on. The
patches provided by Andrei in [1], [3], and [4] fix this issue at the index
scan path level. As Thomas pointed out in [5], that could lead to
inconsistency between the number of rows used for unique index scan
estimation and the value displayed in EXPLAIN (and used for other paths).
Even though this was debated in [6], I can confirm this inconsistency.
Thus, with the patch published in [4], I can see the 28-row estimation with
a unique index scan.

` QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using a_pkey on a (cost=0.28..8.30 rows=28 width=12)
Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

Also, there is a set of patches [7], [8], and [9], which makes the
optimizer consider path selectivity as long as path costs during the path
selection. I've rechecked that none of these patches could resolve the
original problem described in [1]. Also, I think they are quite tricky.
The model of our optimizer assumes that paths in the list should be the
different ways of getting the same result. If we choose the paths by their
selectivity, that breaks this model. I don't say there is no way for
this. But if we do this, that would require significant rethinking of our
optimizer model and possible revision of a significant part of it. Anyway,
I think if there is still interest in this, that should be moved into a
separate thread to keep this thread focused on the problem described in [1].

Finally, I'd like to note that the issue described in [1] is mostly the
selectivity estimation problem. It could be solved by adding the
multi-column MCV statistics. The patches published so far look more like
hacks for particular use cases rather than appropriate solutions. It still
looks promising to me to use the knowledge of unique constraints during
selectivity estimation [10]. Even though it's hard to implement and
possibly implies some overhead, it fits the current model. I also think
unique contracts could probably be used in some way to improve estimates
even when there is no full match.

Links.
1.
https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3.
https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4.
https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5.
https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6.
https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7.
https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8.
https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9.
https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10.
https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com

------
Regards,
Alexander Korotkov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2023-12-18 13:30:12 Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
Previous Message Amit Kapila 2023-12-18 13:04:19 Re: "pgoutput" options missing on documentation