Re: Adjust tuples estimate for appendrels

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tender Wang <tndrwang(at)gmail(dot)com>
Subject: Re: Adjust tuples estimate for appendrels
Date: 2025-01-27 17:52:34
Message-ID: cd6ef6c8-c99d-4772-bc27-7cbd3b4fd9c6@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On 27.01.2025 11:05, Richard Guo wrote:
> (This was briefly discussed in [1], which primarily focuses on the
> incremental sort regression. So start a new thread for this topic.)
>
> In set_append_rel_size(), we currently set rel->tuples to rel->rows
> for an appendrel. Generally, rel->tuples is the raw number of tuples
> in the relation and rel->rows is the estimated number of tuples after
> the relation's restriction clauses have been applied. Although an
> appendrel itself doesn't directly enforce any quals today, its child
> relations may. Therefore, setting rel->tuples equal to rel->rows for
> an appendrel isn't always appropriate.
>
> AFAICS, doing so can lead to issues in cost estimates. For instance,
> when estimating the number of distinct values from an appendrel, we
> would not be able to adjust the estimate based on the restriction
> selectivity (see estimate_num_groups()).
>
> Attached is a patch that sets an appendrel's tuples to the total
> number of tuples accumulated from each live child, which I believe
> aligns better with reality.
>
> Here's a simple example that demonstrates how this change improves
> cost estimates in certain cases.
>
> create table p (a int, b int, c float) partition by range(a);
> create table p1 partition of p for values from (0) to (1000);
> create table p2 partition of p for values from (1000) to (2000);
>
> insert into p select i%2000, random(1, 100000), random(1, 100000) from
> generate_series(1, 1000000)i;
>
> analyze p;
>
> explain analyze select b from p where c < 10000 group by b;
>
> -- without this patch
> HashAggregate (cost=18651.38..19568.54 rows=91716 width=4)
> (actual time=467.859..487.227 rows=63346 loops=1)
>
> -- with this patch
> HashAggregate (cost=18651.38..19275.60 rows=62422 width=4)
> (actual time=447.383..466.351 rows=63346 loops=1)
I looked at it and agree with your solution.
>
> Unfortunately, I failed to come up with a stable test case that shows
> a plan diff with this change. So the attached patch does not include
> a test case for now.

I created a stable test:

create table p (a int, b int, c float) partition by range(a);
create table p1 partition of p for values from (0) to (100);
create table p2 partition of p for values from (100) to (1000);
insert into p select i%200, i%300, i%400from
generate_series(1, 1000)i;
analyze p;
SELECT * FROM check_estimated_rows('select b from p where c < 10 group
by b');
estimated | actual
-----------+--------
27| 29
(1row)
drop table p;
I added it in the diff file.

--
Regards,
Alena Rybakina
Postgres Professional

Attachment Content-Type Size
test.diff text/x-patch 1.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-27 19:39:46 Re: Adding extension default version to \dx
Previous Message Masahiko Sawada 2025-01-27 17:52:16 Re: Eagerly scan all-visible pages to amortize aggressive vacuum