Quick Links

Re: Performance Issue on Query 18 of TPC-H Benchmark

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Ba Jinsheng <bajinsheng(at)u(dot)nus(dot)edu>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Performance Issue on Query 18 of TPC-H Benchmark
Date:	2024-10-15 23:21:09
Message-ID:	38109579-3508-4a79-af55-8eb05ed1d7f0@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi!

On 15.10.2024 21:28, Ba Jinsheng wrote:
> For this query 18 of TPC-H benchmark:
> select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
> sum(l_quantity) from CUSTOMER, ORDERS, LINEITEM where o_orderkey in (
> select l_orderkey from LINEITEM group by l_orderkey having
> sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey =
> l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate,
> o_totalprice order by o_totalprice desc, o_orderdate limit 100;
>
>
> Based on 1GB data
> (https://drive.google.com/file/d/1ZBLHanIRwxbaMQIhRUSPv4I7y8g_0AWi/view?usp=drive_link
> <https://drive.google.com/file/d/1ZBLHanIRwxbaMQIhRUSPv4I7y8g_0AWi/view?usp=drive_link>),
> its query plan (EXPLAIN ANALYZE) is as follows:
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=666783.65..*666783.90 *rows=100 width=71) (actual
> time=2511.074..2517.146 rows=9 loops=1)
> -> Sort (cost=666783.65..668052.46 rows=507522 width=71) (actual
> time=2511.073..2517.144 rows=9 loops=1)
> Sort Key: orders.o_totalprice DESC, orders.o_orderdate
> Sort Method: quicksort Memory: 25kB
> -> Finalize GroupAggregate (cost=583237.80..647386.53
> rows=507522 width=71) (actual time=2511.057..2517.137 rows=9 loops=1)
> Group Key: customer.c_custkey, orders.o_orderkey
> -> Gather Merge (cost=583237.80..636813.14
> rows=422936 width=71) (actual time=2511.053..2517.128 rows=9 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Partial GroupAggregate
> (cost=582237.78..586995.81 rows=211468 width=71) (actual
> time=2507.789..2507.795 rows=3 loops=3)
> Group Key: customer.c_custkey,
> orders.o_orderkey
> ->* Sort (cost=582237.78..582766.45
> rows=211468 width=44) (actual time=2507.783..2507.786 rows=21 loops=3)*
> Sort Key: customer.c_custkey,
> orders.o_orderkey
> Sort Method: quicksort Memory: 26kB
> Worker 0: Sort Method: quicksort
> Memory: 26kB
> Worker 1: Sort Method: quicksort
> Memory: 25kB
> -> Nested Loop
> (cost=458569.18..557026.85 rows=211468 width=44) (actual
> time=2464.821..2507.757 rows=21 loops=3)
> -> Parallel Hash Join
> (cost=458568.75..492734.09 rows=52844 width=43) (actual
> time=2464.793..2507.699 rows=3 loops=3)
> Hash Cond:
> (orders.o_custkey = customer.c_custkey)
> -> Hash Join
> (cost=453562.50..487589.13 rows=52844 width=24) (actual
> time=2433.507..2476.356 rows=3 loops=3)
> Hash Cond:
> (orders.o_orderkey = lineitem_1.l_orderkey)
> -> Parallel Seq
> Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual
> time=0.024..33.467 rows=500000 loops=3)
> -> Hash
> (cost=451977.19..451977.19 rows=126825 width=4) (actual
> time=2412.836..2412.836 rows=9 loops=3)
> Buckets: 131072 Batches: 1 Memory Usage: 1025kB
> ->
> GroupAggregate (cost=0.43..451977.19 rows=126825 width=4) (actual
> time=708.272..2412.797 rows=9 loops=3)
> Group Key: lineitem_1.l_orderkey
> Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
> Rows Removed by Filter: 1499991
> *-> Index Scan using lineitem_pkey on lineitem lineitem_1
> (cost=0.43..416256.96 rows=6002623 width=9) (actual
> time=0.052..1331.820 rows=6001215 loops=3)*
> -> Parallel Hash
> (cost=4225.00..4225.00 rows=62500 width=23) (actual
> time=30.683..30.683 rows=50000 loops=3)
> Buckets: 262144
> Batches: 1 Memory Usage: 10304kB
> -> Parallel Seq
> Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual
> time=0.019..11.368 rows=50000 loops=3)
> -> Index Scan using
> lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual
> time=0.016..0.017 rows=7 loops=9)
> Index Cond: (l_orderkey =
> orders.o_orderkey)
> Planning Time: 0.833 ms
> Execution Time: 2517.189 ms
> (36 rows)
>
>
> While I found that if we disable the path sorting here:
> diff --git a/src/backend/optimizer/plan/planner.c
> b/src/backend/optimizer/plan/planner.c
> index 0c7273b9cc..91320f473a 100644
> --- a/src/backend/optimizer/plan/planner.c
> +++ b/src/backend/optimizer/plan/planner.c
> @@ -6928,7 +6928,7 @@ make_ordered_path(PlannerInfo *root, RelOptInfo
> *rel, Path *path,
> path->pathkeys,
> &presorted_keys);
> - if (!is_sorted)
> + if (false)
> {
> /*
> * Try at least sorting the cheapest path and also try
> incrementally
>
> Both the performance and estimated cost are reduced around 40%:
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=390889.59..*390889.84 *rows=100 width=71) (actual
> time=1652.572..1655.298 rows=9 loops=1)
> -> Sort (cost=390889.59..392158.39 rows=507522 width=71) (actual
> time=1652.571..1655.296 rows=9 loops=1)
> Sort Key: orders.o_totalprice DESC, orders.o_orderdate
> Sort Method: quicksort Memory: 25kB
> -> Finalize GroupAggregate (cost=215938.45..371492.46
> rows=507522 width=71) (actual time=1651.864..1655.256 rows=9 loops=1)
> Group Key: customer.c_custkey, orders.o_orderkey
> -> Gather (cost=215938.45..360919.08 rows=422936
> width=71) (actual time=1651.670..1655.245 rows=9 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Partial GroupAggregate
> (cost=214938.45..317625.48 rows=211468 width=71) (actual
> time=1616.827..1648.580 rows=3 loops=3)
> Group Key: customer.c_custkey,
> orders.o_orderkey
> -> Nested Loop (cost=214938.45..313396.12
> rows=211468 width=44) (actual time=1609.796..1648.571 rows=21 loops=3)
> -> Parallel Hash Join
> (cost=214938.02..249103.36 rows=52844 width=43) (actual
> time=1609.777..1648.532 rows=3 loops=3)
> Hash Cond: (orders.o_custkey =
> customer.c_custkey)
> -> Hash Join
> (cost=209931.77..243958.39 rows=52844 width=24) (actual
> time=1573.950..1612.634 rows=3 loops=3)
> Hash Cond:
> (orders.o_orderkey = lineitem_1.l_orderkey)
> -> Parallel Seq Scan on
> orders (cost=0.00..32386.00 rows=625000 width=20) (actual
> time=0.016..32.211 rows=500000 loops=3)
> -> Hash
> (cost=208346.45..208346.45 rows=126825 width=4) (actual
> time=1549.849..1549.849 rows=9 loops=3)
> Buckets: 131072
> Batches: 1 Memory Usage: 1025kB
> -> GroupAggregate
> (cost=0.00..208346.45 rows=126825 width=4) (actual
> time=334.397..1549.837 rows=9 loops=3)
> Group Key:
> lineitem_1.l_orderkey
> Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
> Rows Removed
> by Filter: 1500388
> * -> Seq Scan on lineitem lineitem_1 (cost=0.00..172626.23
> rows=6002623 width=9) (actual time=0.051..438.226 rows=6001215 loops=3)*
> -> Parallel Hash
> (cost=4225.00..4225.00 rows=62500 width=23) (actual
> time=34.762..34.763 rows=50000 loops=3)
> Buckets: 262144 Batches:
> 1 Memory Usage: 10304kB
> -> Parallel Seq Scan on
> customer (cost=0.00..4225.00 rows=62500 width=23) (actual
> time=0.011..10.406 rows=50000 loops=3)
> -> Index Scan using lineitem_pkey on
> lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.009..0.011
> rows=7 loops=9)
> Index Cond: (l_orderkey =
> orders.o_orderkey)
> Planning Time: 1.644 ms
> Execution Time: 1655.490 ms
> (31 rows)
>
> The major differerence between both query plans is the first one has
> additional *SORT*. I believe the second query plan is more efficient.
> Similar to my last report, perhaps we can optimize code to enable it.
>
> I also tried 10 GB data, in which the execution time is reduced from
> *30s* to *16s*, but the estimated cost is increased. I can provide
> more info if you need.
>
>
> Best regards,
>
> Jinsheng Ba
>
> Notice: This email is generated from the account of an NUS alumnus.
> Contents, views, and opinions therein are solely those of the sender.

can you tell me more information about the user and the name of your
dump database? I can't log into the database.

$ my/inst/bin/psql -U postgres -d postgres
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL: role "postgres" does not exist

--
Regards,
Alena Rybakina
Postgres Professional

In response to

Performance Issue on Query 18 of TPC-H Benchmark at 2024-10-15 18:28:47 from Ba Jinsheng

Responses

Re: Performance Issue on Query 18 of TPC-H Benchmark at 2024-10-16 05:58:59 from Ba Jinsheng

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Michael Paquier	2024-10-16 00:18:10	Re: BUG #18614: [ECPG] out of bound in DecodeDateTime
Previous Message	Fabio R. Sluzala	2024-10-15 22:24:21	Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault