Re: Considering fractional paths in Append node

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Nikita Malakhov <hukutoc(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>,David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: Considering fractional paths in Append node
Date: 2024-10-18 00:54:48
Message-ID: 87frouqlgn.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Nikita Malakhov <hukutoc(at)gmail(dot)com> writes:

> The effect is easily seen in one of standard PG tests:
> Vanilla (current master):
> explain analyze
> select t1.unique1 from tenk1 t1
> inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
> union all
> (values(1)) limit 1;
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------
>
> Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
> -> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
> -> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
> Join Filter: (t1.tenthous = t2.tenthous)
> Rows Removed by Join Filter: 4210
> -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422
> loops=1)
> -> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
> Storage: Memory Maximum Storage: 17kB
> -> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10
> loops=1)
> Filter: (thousand = 0)
> Rows Removed by Filter: 9990
> -> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
> Planning Time: 0.324 ms
> Execution Time: 6.336 ms
> (14 rows)
>
> Patched, the same test:
> explain analyze
> select t1.unique1 from tenk1 t1
> inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
> union all
> (values(1)) limit 1;
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
> Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
> -> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
> -> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
> -> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
> Filter: (thousand = 0)
> Rows Removed by Filter: 421
> -> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual
> time=0.026..0.026 rows=1 loops=1)
> Index Cond: (tenthous = t2.tenthous)
> -> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
> Planning Time: 0.334 ms
> Execution Time: 0.130 ms
> (11 rows)
>

This is a nice result. After some more thoughts, I'm feeling the startup
cost calculation on seq scan looks a more important one to address.

Bad Plan: Append (cost=0.00..2415.09 ..) shows us the "startup cost" is 0.
Good plan: Append (cost=0.29..1383.12 ..) show us the "startup cost" is
0.29.

The major reason of this is we calculate the "startup cost" for
"SeqScan" and "Index scan" with different guidances. For the "Index
scan", the startup cost is "the cost to retrieve the first tuple",
however for "SeqScan", it is not, as we can see the startup cost for
query "SELECT * FROM tenk2 WHERE thousand = 0" has a 0 startup_cost.

In my understading, "startup cost" means the cost to retrieve the first
tuple *already*, but at [1], Tom said:

"""
I think that things might work out better if we redefined the startup
cost as "estimated cost to retrieve the first tuple", rather than its
current very-squishy definition as "cost to initialize the scan".
"""

The above statement makes me confused. If we take the startup cost as
cost to retrieve cost for the first tuple, we can do the below quick hack,

@@ -355,8 +355,8 @@ cost_seqscan(Path *path, PlannerInfo *root,
}

path->disabled_nodes = enable_seqscan ? 0 : 1;
- path->startup_cost = startup_cost;
path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
+ path->startup_cost = startup_cost + (cpu_run_cost + disk_run_cost) * (1 - path->rows / baserel->tuples);
}

We get plan:

regression=# explain
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=470.12..514.00 rows=1 width=4)
-> Append (cost=470.12..952.79 rows=11 width=4)
-> Hash Join (cost=470.12..952.73 rows=10 width=4)
Hash Cond: (t1.tenthous = t2.tenthous)
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8)
-> Hash (cost=470.00..470.00 rows=10 width=4)
-> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4)
Filter: (thousand = 0)
-> Result (cost=0.00..0.01 rows=1 width=4)
(9 rows)

set enable_hashjoin to off;

regression=# explain
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Limit (cost=469.81..542.66 rows=1 width=4)
-> Append (cost=469.81..1271.12 rows=11 width=4)
-> Nested Loop (cost=469.81..1271.05 rows=10 width=4)
-> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4)
Filter: (thousand = 0)
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..80.09 rows=1 width=8)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4)
(8 rows)

Looks we still have some other stuff to do, but we have seen the desired
plan has a closer cost to estimated best plan than before.

[1]
https://www.postgresql.org/message-id/flat/3783591.1721327902%40sss.pgh.pa.us#09d6471fc59b35fa4aca939e49943c2c
--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-10-18 01:41:14 Re: Statistics Import and Export
Previous Message Corey Huinker 2024-10-18 00:54:37 Re: Statistics Import and Export