From: | "bucoo" <bucoo(at)sohu(dot)com> |
---|---|
To: | "'Tomas Vondra'" <tomas(at)vondra(dot)me>, "'pgsql-hackers'" <pgsql-hackers(at)postgresql(dot)org> |
Cc: | "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Robert Haas'" <robertmhaas(at)gmail(dot)com> |
Subject: | 答复: optimize hashjoin |
Date: | 2024-08-23 11:02:22 |
Message-ID: | 000001daf54b$f134e9a0$d39ebce0$@sohu.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using
> * shouldFree as false to call ExecForceStoreMinimalTuple().
> *
> * When slot is TTSOpsMinimalTuple we can avoid realloc memory for
> * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple).
>
> But my point was that I don't think the palloc/repalloc should be very expensive, once the AllocSet warms up a bit.
Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple.
> * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid
> * reform(materialize) tuple(see ExecForceStoreMinimalTuple).
>
> Yeah, but doesn't that conflate two things - materialization and freeing the memory? Only because materialization is expensive, is that a good reason to abandon the memory management too?
Currently, I haven't thought of a better way to avoid reform.
> >
> >> Can you provide more information about the benchmark you did? What
> >> hardware, what scale, PostgreSQL configuration, which of the 22
> >> queries are improved, etc.
> >>
> >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see
> >> pretty much no difference compared to master. However, it occurred to
> >> me the patch only ever helps if we increase the number of batches
> >> during execution, in which case we need to move tuples to the right batch.
> >
> > Only parallel HashJoin speed up to ~2x(all data cached in memory),
> >
> > not full query, include non-parallel HashJoin.
> >
> > non-parallel HashJoin only when batchs large then one will speed up,
> >
> > because this patch only optimize for read batchs tuples to memory.
> >
>
> I'm sorry, but this does not answer *any* of the questions I asked.
>
> Please provide enough info to reproduce the benefit - benchmark scale, which query, which > parameters, etc. Show explain / explain analyze of the query without / with the patch, stuff > like that.
>
> I ran a number of TPC-H benchmarks with the patch and I never a benefit of this scale.
After further testing, it turns out that the parallel hashjoin did not improve performance. I might have compared it with a debug version at the time. I apologize for that.
Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement.
Here is the testing information:
CPU: 13th Gen Intel(R) Core(TM) i7-13700
Memory: 32GB
SSD: UMIS REPEYJ512MKN1QWQ
Windows version: win11 23H2 22631.4037
WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
OS version: rocky linux 9.4
TPCH: SF=8
SQL:
set max_parallel_workers_per_gather = 0;
set enable_mergejoin = off;
explain (verbose,analyze)
select count(*)
from lineitem, orders
where lineitem.l_orderkey = orders.o_orderkey;
patch before:
Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=10591.679..10591.681 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007 loops=1)
Inner Unique: true
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.023..1974.365 rows=47989007 loops=1)
Output: lineitem.l_orderkey
Heap Fetches: 0
-> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1074.155..1074.156 rows=12000000 loops=1)
Output: orders.o_orderkey
Buckets: 262144 Batches: 128 Memory Usage: 5335kB
-> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.014..464.346 rows=12000000 loops=1)
Output: orders.o_orderkey
Heap Fetches: 0
Planning Time: 0.141 ms
Execution Time: 10591.730 ms
(16 rows)
Patch after:
Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=9826.105..9826.106 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007 loops=1)
Inner Unique: true
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.015..1989.389 rows=47989007 loops=1)
Output: lineitem.l_orderkey
Heap Fetches: 0
-> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1086.357..1086.358 rows=12000000 loops=1)
Output: orders.o_orderkey
Buckets: 262144 Batches: 128 Memory Usage: 5335kB
-> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.011..470.225 rows=12000000 loops=1)
Output: orders.o_orderkey
Heap Fetches: 0
Planning Time: 0.065 ms
Execution Time: 9826.135 ms
From | Date | Subject | |
---|---|---|---|
Next Message | Andreas Karlsson | 2024-08-23 11:11:38 | Re: pgstattuple: fix free space calculation |
Previous Message | Peter Smith | 2024-08-23 10:56:24 | Re: Pgoutput not capturing the generated columns |