Re: Incremental Sort Cost Estimation Instability

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: David Rowley <dgrowleyml(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Incremental Sort Cost Estimation Instability
Date: 2024-09-12 14:57:18
Message-ID: 9a5fb370-6c48-400d-a78e-e57fc0ff0379@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/12/24 12:12, David Rowley wrote:
> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>> Initial problem causes wrong cost_sort estimation. Right now I think
>> about providing cost_sort() the sort clauses instead of (or in addition
>> to) the pathkeys.
>
> I'm not quite sure why the sort clauses matter any more than the
> EquivalenceClass. If the EquivalanceClass defines that all members
> will have the same value for any given row, then, if we had to choose
> any single member to drive the n_distinct estimate from, isn't the
> most accurate distinct estimate from the member with the smallest
> n_distinct estimate? (That assumes the less distinct member has every
> value the more distinct member has, which might not be true)
>
> David
>

How large can the cost difference get? My assumption was it's correlated
with how different the ndistincts are for the two sides, so I tried

CREATE TABLE t1(x integer, y integer,z text);
CREATE TABLE t2(x integer, y integer,z text);

INSERT INTO t1 (x,y) SELECT x, 1
FROM generate_series(1,1000000) AS x;
INSERT INTO t2 (x,y) SELECT mod(x,1000), 1
FROM generate_series(1,1000000) AS x;

CREATE INDEX ON t1(x);
CREATE INDEX ON t2(x);
CREATE INDEX ON t1(y);
CREATE INDEX ON t2(y);

VACUUM ANALYZE;

Which changes the ndistinct for t2.x from 1M to 1k. I've expected the
cost difference to get much larger, but in it does not really change:

GroupAggregate (cost=38.99..37886.88 rows=992 width=16) (actual rows=1
loops=1)

GroupAggregate (cost=37874.26..37904.04 rows=992 width=16) (actual
rows=1 loops=1)

That is pretty significant - the total cost difference is tiny, I'd even
say it does not matter in practice (seems well within possible impact of
collecting a different random sample).

But the startup cost changes in rather absurd way, while the rest of the
plan is exactly the same. We even know this:

-> Incremental Sort (cost=38.99..37869.52 rows=992 width=8)
Sort Key: t1.x, t1.y
Presorted Key: t1.x

in both cases. There's literally no other difference between these plans
visible in the explain ...

I'm not sure how to fix this, but it seems estimate_num_groups() needs
to do things differently. And I agree looking for the minimum ndistinct
seems like the right approach. but doesn't estimate_num_groups()
supposed to already do that? The comment says:

* 3. If the list contains Vars of different relations that are known equal
* due to equivalence classes, then drop all but one of the Vars from each
* known-equal set, keeping the one with smallest estimated # of values
* (since the extra values of the others can't appear in joined rows).
* Note the reason we only consider Vars of different relations is that
* if we considered ones of the same rel, we'd be double-counting the
* restriction selectivity of the equality in the next step.

I haven't debugged this, but how come this doesn't do the trick?

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2024-09-12 15:43:12 may be a mismatch between the construct_array and construct_md_array comments
Previous Message Matthias van de Meent 2024-09-12 14:49:24 Re: Adding skip scan (including MDAM style range skip scan) to nbtree