Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>, <jonah(dot)harris(at)gmail(dot)com>
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-11-07 00:31:15
Message-ID: 6EEA43D22289484890D119821101B1DF2C16D7@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> > Minor question on this patch. AFAICS there is another patch that
seems
> > to be aiming at exactly the same use case. Jonah's Bloom filter
patch.
> >
> > Shouldn't we have a dust off to see which one is best? Or at least a
> > discussion to test whether they overlap? Perhaps you already did
that
> > and I missed it because I'm not very tuned in on this thread.
> >
> > --
> > Simon Riggs www.2ndQuadrant.com
> > PostgreSQL Training, Services and Support
>
> We haven't had that discussion AFAIK, and definitely should. First
> glance suggests they could coexist peacefully, with proper coaxing. If
> I understand things properly, Jonah's patch filters tuples early in
> the join process, and this patch tries to ensure that hash join
> batches are kept in RAM when they're most likely to be used. So
> they're orthogonal in purpose, and the patches actually apply *almost*
> cleanly together. Jonah, any comments? If I continue to have some time
> to devote, and get through all I think I can do to review this patch,
> I'll gladly look at Jonah's too, FWIW.
>
> - Josh

The skew patch and bloom filter patch are orthogonal and can both be
applied. The bloom filter patch is a great idea, and it is used in many
other database systems. You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).

Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter. For example,

select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkey

The bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.

I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon(dot)lawrence(at)ubc(dot)ca

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ITAGAKI Takahiro 2008-11-07 00:41:43 Re: No write stats in pg_statio system views
Previous Message Robert Haas 2008-11-07 00:25:38 Re: [WIP] In-place upgrade