Quick Links

Re: Hash Joins vs. Bloom Filters / take 2

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	"Finnerty, Jim" <jfinnert(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hash Joins vs. Bloom Filters / take 2
Date:	2018-11-02 14:34:04
Message-ID:	CA+Tgmoa4M4tOv93EM10CcMJ0h0T1mp9fZm9bpnhsh7qOsY_q+Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?

I'm interested in what Jim has to say, but to me it seems like we
should try to find a way to add a tlist entry for the hash value to
avoid recomputing it. That's likely to require some tricky planner
surgery, but it's probably doable.

What really seems finicky to me about this whole project is the
costing. In the best case it's a a huge win; in the worst case it's a
significant loss; and whether it's a gain or a loss is not easy to
figure out from the information that we have available. We generally
do not have an accurate count of the number of distinct values we're
likely to see (which is important).

Worse, when you start to consider pushdown, you realize that the cost
of the scan depends on the bloom filter we push down to it. So
consider something like A IJ B IJ C. It seems like it could be the
case that once we decide to do the A-B join as a hash join with a
bloom filter, it makes sense to also do the join to C as a hash join
and push down the bloom filter, because we'll be able to combine the
two filters and the extra probes will be basically free. But if we
weren't already doing the A-B join with a bloom filter, then maybe the
filter wouldn't make sense for C either.

Maybe I'm worrying over nothing here, or the wrong things, but costing
this well enough to avoid regressions really looks hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Hash Joins vs. Bloom Filters / take 2 at 2018-11-01 21:06:47 from Thomas Munro

Responses

Re: Hash Joins vs. Bloom Filters / take 2 at 2018-11-05 16:04:02 from Jim Finnerty

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2018-11-02 14:35:12	Re: Vacuum Full does not release the disk size space after delete from table
Previous Message	David Fetter	2018-11-02 14:21:32	Re: COPY FROM WHEN condition