Re: Performance Issue on Query 18 of TPC-H Benchmark

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Ba Jinsheng <bajinsheng(at)u(dot)nus(dot)edu>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Performance Issue on Query 18 of TPC-H Benchmark
Date: 2024-10-17 02:51:12
Message-ID: 98398f84-6079-47f9-958c-c7e323228cc7@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 10/17/24 01:26, Ba Jinsheng wrote:
> >I believe you are not allowing the optimizer to generate a different
> aggregation path (Group Aggregate) because it requires a sort operation.
> So I think this is not correct.
>
> Yes, this is what I did. I though it is what you were asking? I have not
> found another way to enforce HashAggregate, so I directly modified the
> code. Can you eliberate why it is incorrect?
As I see, the main problem lies in the first aggregate (with HAVING
clause) where hash aggregation seems preferable. To disable that, you
can use some trick:
SET enable_hashagg = 'on';
SET enable_sort = 'off';
SET work_mem = '1GB';

In my case the optimiser have built aggregation:
-> HashAggregate (cost=202639.35..208346.45 rows=126825 width=4)
(actual time=3540.761..4224.547 rows=9 loops=3)
Group Key: lineitem_1.l_orderkey
Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
Batches: 1 Memory Usage: 638993kB
Rows Removed by Filter: 1499991
Worker 0: Batches: 1 Memory Usage: 638993kB
Worker 1: Batches: 1 Memory Usage: 638993kB
-> Seq Scan on lineitem lineitem_1
(cost=0.00..172626.23 rows=6002623 width=9)
(actual time=0.014..675.552 rows=6001215 loops=3)

Not sure it is the best plan possible in this case. I know only about
re-optimisation feature which can provide correct number of groups
estimation to aggregates as well as cardinality and work_mem and give
the optimiser all correct estimations. But it is still an enterprise
feature :(.

> Can I understand disabling parallelism is a good setup for finding
> performance issues?
I usually use such such a switch to identify problems. Parallel workers
use HashJoin more frequently and it sometimes cause non-linear behaviour
of execution time, compared to sequental plan. At the same time, they
donn't support parameterised paths and it may end up in interesting
performance jumps ...

BTW, I also wonder why do you report to pgsql-bugs in presence of better
fitted pgsql-performance thread?

--
regards, Andrei Lepikhov

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2024-10-17 03:52:09 Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault
Previous Message David Rowley 2024-10-17 02:14:35 Re: Performance Issue on Query 18 of TPC-H Benchmark