Re: Performance of Query 4 on TPC-DS Benchmark

From: Ba Jinsheng <bajinsheng(at)u(dot)nus(dot)edu>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Performance of Query 4 on TPC-DS Benchmark
Date: 2024-11-11 10:49:56
Message-ID: SEZPR06MB649449C88C484B0BDE17A9C28A582@SEZPR06MB6494.apcprd06.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


>It is all the time a challenge for PostgreSQL to estimate such a filter
>because of absent information on joint column distribution.
>Can you research this way by building extended statistics on these
>clauses? It could move the plan to the more optimal direction.

Thanks a lot for your effort to analyze this issue, and we really appreciate your suggestions! Currently, we focus on exposing these issues that affect performance. In the future, we may consider to look into such a direction as you suggested.

> Have you tried any tools to improve the cardinality yet, like aqo [0]?
Yes, but it takes nearly 1 hour to run this query at a time, so I only run "EXPLAIN ANALYZE" once, and the performance seems slightly improved.

QUERY PLAN

Limit (cost=293880.50..293880.50 rows=1 width=132) (actual time=2527921.078..2527921.233 rows=8 loops=1)
CTE year_total
-> Gather (cost=115049.92..233367.07 rows=384208 width=216) (actual time=1116.139..4005.105 rows=384208 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=114049.92..193946.27 rows=160087 width=216) (actual time=2430.791..2510.131 rows=128069 loops=3)
-> HashAggregate (cost=190763.57..193145.83 rows=190581 width=216) (actual time=3977.521..4070.200 rows=190581 loops=1)
Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year
Worker 1: Batches: 1 Memory Usage: 120857kB
-> Hash Join (cost=8151.60..103486.35 rows=2685453 width=174) (actual time=64.667..1605.601 rows=2685453 loops=1)
Hash Cond: (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
-> Hash Join (cost=5103.00..93216.88 rows=2750652 width=174) (actual time=48.111..1121.801 rows=2750652 loops=1)
Hash Cond: (store_sales.ss_customer_sk = customer.c_customer_sk)
-> Seq Scan on store_sales (cost=0.00..80552.52 rows=2880404 width=30) (actual time=0.068..230.529 rows=2880404 loops=1)
-> Hash (cost=3853.00..3853.00 rows=100000 width=152) (actual time=47.735..47.735 rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 17161kB
-> Seq Scan on customer (cost=0.00..3853.00 rows=100000 width=152) (actual time=0.012..25.023 rows=100000 loops=1)
-> Hash (cost=2135.49..2135.49 rows=73049 width=8) (actual time=16.242..16.242 rows=73049 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3878kB
-> Seq Scan on date_dim (cost=0.00..2135.49 rows=73049 width=8) (actual time=0.074..8.744 rows=73049 loops=1)
-> HashAggregate (cost=114049.92..115762.15 rows=136978 width=216) (actual time=2199.723..2268.851 rows=136978 loops=1)
Group Key: customer_1.c_customer_id, customer_1.c_first_name, customer_1.c_last_name, customer_1.c_preferred_cust_flag, customer_1.c_birth_country, customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year
Worker 0: Batches: 1 Memory Usage: 88089kB
-> Hash Join (cost=8151.60..67544.41 rows=1430939 width=177) (actual time=81.920..911.231 rows=1430939 loops=1)
Hash Cond: (catalog_sales.cs_sold_date_sk = date_dim_1.d_date_sk)
-> Hash Join (cost=5103.00..60729.97 rows=1434519 width=177) (actual time=53.469..638.140 rows=1434519 loops=1)
Hash Cond: (catalog_sales.cs_bill_customer_sk = customer_1.c_customer_sk)
-> Seq Scan on catalog_sales (cost=0.00..51842.75 rows=1441548 width=33) (actual time=0.066..134.023 rows=1441548 loops=1)
-> Hash (cost=3853.00..3853.00 rows=100000 width=152) (actual time=52.937..52.937 rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 17161kB
-> Seq Scan on customer customer_1 (cost=0.00..3853.00 rows=100000 width=152) (actual time=0.019..27.549 rows=100000 loops=1)
-> Hash (cost=2135.49..2135.49 rows=73049 width=8) (actual time=27.968..27.968 rows=73049 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3878kB
-> Seq Scan on date_dim date_dim_1 (cost=0.00..2135.49 rows=73049 width=8) (actual time=0.099..14.115 rows=73049 loops=1)
-> HashAggregate (cost=61268.33..61976.44 rows=56649 width=216) (actual time=1115.125..1142.838 rows=56649 loops=1)
Group Key: customer_2.c_customer_id, customer_2.c_first_name, customer_2.c_last_name, customer_2.c_preferred_cust_flag, customer_2.c_birth_country, customer_2.c_login, customer_2.c_email_address, date_dim_2.d_year
Batches: 1 Memory Usage: 35865kB
-> Hash Join (cost=8151.60..37896.96 rows=719119 width=177) (actual time=85.606..491.698 rows=719119 loops=1)
Hash Cond: (web_sales.ws_sold_date_sk = date_dim_2.d_date_sk)
-> Hash Join (cost=5103.00..32960.30 rows=719217 width=177) (actual time=59.536..342.685 rows=719217 loops=1)
Hash Cond: (web_sales.ws_bill_customer_sk = customer_2.c_customer_sk)
-> Seq Scan on web_sales (cost=0.00..25968.84 rows=719384 width=33) (actual time=0.032..67.592 rows=719384 loops=1)
-> Hash (cost=3853.00..3853.00 rows=100000 width=152) (actual time=59.430..59.430 rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 17161kB
-> Seq Scan on customer customer_2 (cost=0.00..3853.00 rows=100000 width=152) (actual time=0.006..33.826 rows=100000 loops=1)
-> Hash (cost=2135.49..2135.49 rows=73049 width=8) (actual time=25.997..25.998 rows=73049 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3878kB
-> Seq Scan on date_dim date_dim_2 (cost=0.00..2135.49 rows=73049 width=8) (actual time=0.016..13.499 rows=73049 loops=1)
-> Sort (cost=60513.43..60513.44 rows=1 width=132) (actual time=2527921.077..2527921.080 rows=8 loops=1)
Sort Key: t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_email_address
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=0.00..60513.42 rows=1 width=132) (actual time=388081.669..2527921.053 rows=8 loops=1)
Join Filter: ((t_s_secyear.customer_id = t_c_firstyear.customer_id) AND (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_s_firstyear.year_total > '0'::numeric) THEN (t_s_secyear.year_total / t_s_firstyear.year_total) ELSE NULL::numeric END) AND (CASE WHEN (t_c_firstyear.year_total > '0'::numeric) THEN (t_c_secyear.year_total / t_c_firstyear.year_total) ELSE NULL::numeric END > CASE WHEN (t_w_firstyear.year_total > '0'::numeric) THEN (t_w_secyear.year_total / t_w_firstyear.year_total) ELSE NULL::numeric END))
Rows Removed by Join Filter: 1289378
-> Nested Loop (cost=0.00..49947.59 rows=1 width=372) (actual time=106080.918..2524931.374 rows=49 loops=1)
Join Filter: (t_s_secyear.customer_id = t_w_secyear.customer_id)
Rows Removed by Join Filter: 4962083
-> Nested Loop (cost=0.00..40342.26 rows=1 width=320) (actual time=10316.060..2506476.401 rows=441 loops=1)
Join Filter: (t_s_secyear.customer_id = t_c_secyear.customer_id)
Rows Removed by Join Filter: 44298069
-> Nested Loop (cost=0.00..30736.94 rows=1 width=268) (actual time=7429.522..2435241.747 rows=1630 loops=1)
Join Filter: (t_s_firstyear.customer_id = t_s_secyear.customer_id)
Rows Removed by Join Filter: 165296120
-> Nested Loop (cost=0.00..21131.61 rows=1 width=104) (actual time=3984.981..2240540.776 rows=4330 loops=1)
Join Filter: (t_s_firstyear.customer_id = t_w_firstyear.customer_id)
Rows Removed by Join Filter: 429435722
-> CTE Scan on year_total t_s_firstyear (cost=0.00..10565.72 rows=3 width=52) (actual time=3984.972..4058.789 rows=37923 loops=1)
Filter: ((year_total > '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001))
Rows Removed by Filter: 346285
-> CTE Scan on year_total t_w_firstyear (cost=0.00..10565.72 rows=3 width=52) (actual time=0.001..58.349 rows=11324 loops=37923)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'w'::text) AND (dyear = 2001))
Rows Removed by Filter: 372884
-> CTE Scan on year_total t_s_secyear (cost=0.00..9605.20 rows=10 width=164) (actual time=20.421..42.865 rows=38175 loops=4330)
Filter: ((sale_type = 's'::text) AND (dyear = 2002))
Rows Removed by Filter: 346033
-> CTE Scan on year_total t_c_secyear (cost=0.00..9605.20 rows=10 width=52) (actual time=5.974..42.155 rows=27177 loops=1630)
Filter: ((sale_type = 'c'::text) AND (dyear = 2002))
Rows Removed by Filter: 357031
-> CTE Scan on year_total t_w_secyear (cost=0.00..9605.20 rows=10 width=52) (actual time=0.002..41.219 rows=11252 loops=441)
Filter: ((sale_type = 'w'::text) AND (dyear = 2002))
Rows Removed by Filter: 372956
-> CTE Scan on year_total t_c_firstyear (cost=0.00..10565.72 rows=3 width=52) (actual time=8.525..59.572 rows=26314 loops=49)
Filter: ((year_total > '0'::numeric) AND (sale_type = 'c'::text) AND (dyear = 2001))
Rows Removed by Filter: 357894
Planning Time: 17.924 ms
Execution Time: 2527936.040 ms
(86 rows)
Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ba Jinsheng 2024-11-11 10:51:16 Re: Performance of Query 4 on TPC-DS Benchmark
Previous Message Andrei Lepikhov 2024-11-11 09:41:01 Re: Performance of Query 4 on TPC-DS Benchmark