Quick Links

Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Subject:	Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date:	2008-11-02 23:48:36
Message-ID:	6EEA43D22289484890D119821101B1DF2C16BC@exchange20.mercury.ad.ubc.ca
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets. It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon(dot)lawrence(at)ubc(dot)ca

> -----Original Message-----
> From: Joshua Tolley [mailto:eggyknap(at)gmail(dot)com]
> Sent: November 1, 2008 3:42 PM
> To: Lawrence, Ramon
> Cc: pgsql-hackers(at)postgresql(dot)org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for Skewed Data Sets
>
> On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
<ramon(dot)lawrence(at)ubc(dot)ca>
> wrote:
> > We propose a patch that improves hybrid hash join's performance for
> large
> > multi-batch joins where the probe relation has skew.
> >
> > Project name: Histojoin
> > Patch file: histojoin_v1.patch
> >
> > This patch implements the Histojoin join algorithm as an optional
> feature
> > added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable
> or
> > disable the Histojoin features. When Histojoin is disabled, HHJ
acts as
> > normal. The Histojoin features allow HHJ to use PostgreSQL's
statistics
> to
> > do skew aware partitioning. The basic idea is to keep build
relation
> tuples
> > in a small in-memory hash table that have join values that are
> frequently
> > occurring in the probe relation. This improves performance of HHJ
when
> > multiple batches are used by 10% to 50% for skewed data sets. The
> > performance improvements of this patch can be seen in the paper
(pages
> > 25-30) at:
> >
> > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
> >
> > All generators and materials needed to verify these results can be
> provided.
> >
> > This is a patch against the HEAD of the repository.
> >
> > This patch does not contain platform specific code. It compiles and
has
> > been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
> >
> > Currently the Histojoin feature is enabled by default and is used
> whenever
> > HHJ is used and there are Most Common Value (MCV) statistics
available
> on
> > the probe side base relation of the join. To disable this feature
> simply
> > set the enable_hashjoin_usestatmcvs flag to off in the database
> > configuration file or at run time with the 'set' command.
> >
> > One potential improvement not included in the patch is that Most
Common
> > Value (MCV) statistics are only determined when the probe relation
is
> > produced by a scan operator. There is a benefit to using MCVs even
when
> the
> > probe relation is not a base scan, but we were unable to determine
how
> to
> > find statistics from a base relation after other operators are
> performed.
> >
> > This patch was created by Bryce Cutt as part of his work on his
M.Sc.
> > thesis.
> >
> > --
> > Dr. Ramon Lawrence
> > Assistant Professor, Department of Computer Science, University of
> British
> > Columbia Okanagan
> > E-mail: ramon(dot)lawrence(at)ubc(dot)ca
>
> I'm interested in trying to review this patch. Having not done patch
> review before, I can't exactly promise grand results, but if you could
> provide me with the data to check your results? In the meantime I'll
> go read the paper.
>
> - Josh / eggyknap

In response to

Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets at 2008-11-01 22:41:48 from Joshua Tolley

Responses

Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets at 2008-11-03 00:41:55 from Joshua Tolley
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets at 2008-11-03 01:36:24 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua Tolley	2008-11-03 00:41:55	Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Previous Message	Greg Stark	2008-11-02 23:28:49	Re: WIP: Hash Join-Filter Pruning using Bloom Filters