From: | Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru> |
---|---|
To: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, tomas(at)vondra(dot)me |
Subject: | Re: Incremental Sort Cost Estimation Instability |
Date: | 2024-11-07 11:06:37 |
Message-ID: | 6291ee9c-a87d-43a9-b848-43a53cef7762@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
On 07.11.2024 08:57, Andrei Lepikhov wrote:
> On 10/8/24 11:33, Andrei Lepikhov wrote:
>> On 9/23/24 20:02, Andrei Lepikhov wrote:
>>> On 12/9/2024 12:12, David Rowley wrote:
>>>> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov(at)gmail(dot)com>
>> Minor change to make compiler and cfbot happy
> Now, this thread looks connected to the [1]. However, it still has
> independent profit, which can be discussed separately.
> After the introduction of the em->em_ndistinct cache, I played around
> with the idea of letting the estimate_num_groups use this cache.
> Occasionally found out that we have one more instability case like
> the following:
>
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (x int, y int, z int);
> INSERT INTO test (x,y,z) (SELECT random()*1E5, random()*2, random()
> FROM generate_series(1,1e4));
> VACUUM ANALYZE test;
>
> EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY x,y;
> EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY y,x;
>
> Here, you can see that depending on the initial order of grouping,
> Postgres chooses different columns for grouping. Doing that causes
> different estimations - one of them is definitely wrong:
>
> GroupAggregate (cost=181.41..182.29 rows=50 width=16)
> GroupAggregate (cost=181.41..181.82 rows=3 width=16)
>
> That happens because when estimating the number of groups, Postgres
> doesn't consider EquivalenceClass, which can let him correct group
> estimation at a low price.
> It may be done inside the make_pathkeys_for_sortclauses_extended by
> choosing a column with a lower number of distinct, but IMO, it is
> better to do it at the moment of the number of groups estimation.
>
> Thoughts? Is it a real issue or just a non-practical corner case?
>
> The new version of the patch is attached.
>
> [1]
> https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com
But you haven’t considered the case when you need to use non-cached
values, for example, if ndistinct has already changed. Look, here x has
a minimum ndistinct, and then column z:
alena(at)postgres=# delete from test;
DELETE 10000
alena(at)postgres=# INSERT INTO test (x,y,z) (SELECT *id%3*, id*2, id FROM
generate_series(1,1e4) as id);
INSERT 0 10000
alena(at)postgres=# EXPLAIN SELECT * FROM test where x=y ORDER BY x,y,z;
QUERY PLAN
--------------------------------------------------------------
Sort (cost=196.88..197.02 rows=56 width=12)
*Sort Key: x, z*
-> Seq Scan on test (cost=0.00..195.25 rows=56 width=12)
Filter: (x = y)
(4 rows)
alena(at)postgres=# delete from test;
DELETE 10000
alena(at)postgres=# INSERT INTO test (x,y,z) (SELECT id, id*2, *id%3* FROM
generate_series(1,1e4) as id);
INSERT 0 10000
alena(at)postgres=# vacuum analyze;
VACUUM
alena(at)postgres=# EXPLAIN SELECT * FROM test where x=y ORDER BY x,y,z;
QUERY PLAN
--------------------------------------------------------------
Sort (cost=235.41..235.54 rows=50 width=12)
*Sort Key: x, z*
-> Seq Scan on test (cost=0.00..234.00 rows=50 width=12)
Filter: (x = y)
(4 rows)
but the order of the columns does not change, as you can see.
--
Regards,
Alena Rybakina
Postgres Professional
From | Date | Subject | |
---|---|---|---|
Next Message | Tender Wang | 2024-11-07 11:07:35 | Unsafe access BufferDescriptors array in BufferGetLSNAtomic() |
Previous Message | Masahiro Ikeda | 2024-11-07 10:44:24 | Re: Avoiding superfluous buffer locking during nbtree backwards scans |