From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | David Rowley <dgrowleyml(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Subject: | Re: Incremental Sort Cost Estimation Instability |
Date: | 2024-09-12 14:57:18 |
Message-ID: | 9a5fb370-6c48-400d-a78e-e57fc0ff0379@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 9/12/24 12:12, David Rowley wrote:
> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>> Initial problem causes wrong cost_sort estimation. Right now I think
>> about providing cost_sort() the sort clauses instead of (or in addition
>> to) the pathkeys.
>
> I'm not quite sure why the sort clauses matter any more than the
> EquivalenceClass. If the EquivalanceClass defines that all members
> will have the same value for any given row, then, if we had to choose
> any single member to drive the n_distinct estimate from, isn't the
> most accurate distinct estimate from the member with the smallest
> n_distinct estimate? (That assumes the less distinct member has every
> value the more distinct member has, which might not be true)
>
> David
>
How large can the cost difference get? My assumption was it's correlated
with how different the ndistincts are for the two sides, so I tried
CREATE TABLE t1(x integer, y integer,z text);
CREATE TABLE t2(x integer, y integer,z text);
INSERT INTO t1 (x,y) SELECT x, 1
FROM generate_series(1,1000000) AS x;
INSERT INTO t2 (x,y) SELECT mod(x,1000), 1
FROM generate_series(1,1000000) AS x;
CREATE INDEX ON t1(x);
CREATE INDEX ON t2(x);
CREATE INDEX ON t1(y);
CREATE INDEX ON t2(y);
VACUUM ANALYZE;
Which changes the ndistinct for t2.x from 1M to 1k. I've expected the
cost difference to get much larger, but in it does not really change:
GroupAggregate (cost=38.99..37886.88 rows=992 width=16) (actual rows=1
loops=1)
GroupAggregate (cost=37874.26..37904.04 rows=992 width=16) (actual
rows=1 loops=1)
That is pretty significant - the total cost difference is tiny, I'd even
say it does not matter in practice (seems well within possible impact of
collecting a different random sample).
But the startup cost changes in rather absurd way, while the rest of the
plan is exactly the same. We even know this:
-> Incremental Sort (cost=38.99..37869.52 rows=992 width=8)
Sort Key: t1.x, t1.y
Presorted Key: t1.x
in both cases. There's literally no other difference between these plans
visible in the explain ...
I'm not sure how to fix this, but it seems estimate_num_groups() needs
to do things differently. And I agree looking for the minimum ndistinct
seems like the right approach. but doesn't estimate_num_groups()
supposed to already do that? The comment says:
* 3. If the list contains Vars of different relations that are known equal
* due to equivalence classes, then drop all but one of the Vars from each
* known-equal set, keeping the one with smallest estimated # of values
* (since the extra values of the others can't appear in joined rows).
* Note the reason we only consider Vars of different relations is that
* if we considered ones of the same rel, we'd be double-counting the
* restriction selectivity of the equality in the next step.
I haven't debugged this, but how come this doesn't do the trick?
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Alena Rybakina | 2024-09-12 15:43:12 | may be a mismatch between the construct_array and construct_md_array comments |
Previous Message | Matthias van de Meent | 2024-09-12 14:49:24 | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |