Re: optimize hashjoin

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: bucoo <bucoo(at)sohu(dot)com>, 'pgsql-hackers' <pgsql-hackers(at)postgresql(dot)org>
Cc: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 'Robert Haas' <robertmhaas(at)gmail(dot)com>
Subject: Re: optimize hashjoin
Date: 2024-08-21 16:31:49
Message-ID: cae0520f-9dc1-4109-9b54-bb583995a5d2@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 8/21/24 03:49, bucoo wrote:
> Avoid unnecessary form and deform tuple.
>

Thanks for the patch. Generally speaking, it's usually a good idea to
briefly explain what the patch does, why is that beneficial, in what
circumstances, etc. It's sometimes not quite obvious from the patch
itself (even if the patch is simple). Plus it really helps new
contributors who want to review the patch, but even for experienced
people it's a huge time saver ...

Anyway, I took a look and the basic idea is simple - when shuffling
tuples between batches in a hash join, we're currently deforming the
tuple (->slot) we just read from a batch, only to immediately form it
(slot->) again and write it to the "correct" batch.

I think the idea to skip this unnecessary deform/form step is sound, and
I don't see any particular argument against doing that. The code looks
reasonable too, I think.

A couple minor comments:

0) The patch does not apply anymore, thanks to David committing a patch
yesterday. Attached is a patch rebased on top of current master.

1) Wouldn't it be easier (and just as efficient) to use slots with
TTSOpsMinimalTuple, i.e. backed by a minimal tuple?

2) I find the naming of the new functions a bit confusing. We now have
the "original" functions working with slots, and then also functions
with "Tuple" working with tuples. Seems asymmetric.

3) The code in ExecHashJoinOuterGetTuple is hard to understand, it'd
very much benefit from some comments. I'm a bit unsure if the likely()
and unlikely() hints really help.

4) Is the hj_outerTupleBuffer buffer really needed / effective? I'd bet
just using palloc() will work just as well, thanks to the AllocSet
caching etc.

5) You might want to add the patch to the 2024-09 CF, to keep track of
it: https://commitfest.postgresql.org/49/

> In the TPCH test, HashJoin speed up to ~2x.
>
Can you provide more information about the benchmark you did? What
hardware, what scale, PostgreSQL configuration, which of the 22 queries
are improved, etc.

I ran TPC-H with 1GB and 10GB scales on two machines, and I see pretty
much no difference compared to master. However, it occurred to me the
patch only ever helps if we increase the number of batches during
execution, in which case we need to move tuples to the right batch.

Perhaps that's not happening in my testing, but it was happening in your
runs? Which just means we really need to know more about how you did the
testing.

regards

--
Tomas Vondra

Attachment Content-Type Size
optimize-hashjoin-rebased-tomas.patch text/x-patch 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-08-21 16:39:21 Re: [PATCH] Add get_bytes() and set_bytes() functions
Previous Message Tomas Vondra 2024-08-21 15:41:02 Re: Partial aggregates pushdown