From: | "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca> |
---|---|
To: | "Joshua Tolley" <eggyknap(at)gmail(dot)com> |
Cc: | <pgsql-hackers(at)postgresql(dot)org>, "Bryce Cutt" <pandasuit(at)gmail(dot)com> |
Subject: | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets |
Date: | 2008-11-02 23:48:36 |
Message-ID: | 6EEA43D22289484890D119821101B1DF2C16BC@exchange20.mercury.ad.ubc.ca |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Joshua,
Thank you for offering to review the patch.
The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing. I have posted the TPC-H generator
at:
http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
The generator can produce skewed data sets. It was produced by
Microsoft Research.
After unzipping, on a Windows machine, you can just run the command:
dbgen -s 1 -z 1
This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1. More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.
The schema DDL is at:
http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt
Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours. I recommend you do not add the foreign keys until after
the data is loaded.
The other alternative is to do a pgdump on our data sets. However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon(dot)lawrence(at)ubc(dot)ca
> -----Original Message-----
> From: Joshua Tolley [mailto:eggyknap(at)gmail(dot)com]
> Sent: November 1, 2008 3:42 PM
> To: Lawrence, Ramon
> Cc: pgsql-hackers(at)postgresql(dot)org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for Skewed Data Sets
>
> On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
<ramon(dot)lawrence(at)ubc(dot)ca>
> wrote:
> > We propose a patch that improves hybrid hash join's performance for
> large
> > multi-batch joins where the probe relation has skew.
> >
> > Project name: Histojoin
> > Patch file: histojoin_v1.patch
> >
> > This patch implements the Histojoin join algorithm as an optional
> feature
> > added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable
> or
> > disable the Histojoin features. When Histojoin is disabled, HHJ
acts as
> > normal. The Histojoin features allow HHJ to use PostgreSQL's
statistics
> to
> > do skew aware partitioning. The basic idea is to keep build
relation
> tuples
> > in a small in-memory hash table that have join values that are
> frequently
> > occurring in the probe relation. This improves performance of HHJ
when
> > multiple batches are used by 10% to 50% for skewed data sets. The
> > performance improvements of this patch can be seen in the paper
(pages
> > 25-30) at:
> >
> > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
> >
> > All generators and materials needed to verify these results can be
> provided.
> >
> > This is a patch against the HEAD of the repository.
> >
> > This patch does not contain platform specific code. It compiles and
has
> > been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
> >
> > Currently the Histojoin feature is enabled by default and is used
> whenever
> > HHJ is used and there are Most Common Value (MCV) statistics
available
> on
> > the probe side base relation of the join. To disable this feature
> simply
> > set the enable_hashjoin_usestatmcvs flag to off in the database
> > configuration file or at run time with the 'set' command.
> >
> > One potential improvement not included in the patch is that Most
Common
> > Value (MCV) statistics are only determined when the probe relation
is
> > produced by a scan operator. There is a benefit to using MCVs even
when
> the
> > probe relation is not a base scan, but we were unable to determine
how
> to
> > find statistics from a base relation after other operators are
> performed.
> >
> > This patch was created by Bryce Cutt as part of his work on his
M.Sc.
> > thesis.
> >
> > --
> > Dr. Ramon Lawrence
> > Assistant Professor, Department of Computer Science, University of
> British
> > Columbia Okanagan
> > E-mail: ramon(dot)lawrence(at)ubc(dot)ca
>
> I'm interested in trying to review this patch. Having not done patch
> review before, I can't exactly promise grand results, but if you could
> provide me with the data to check your results? In the meantime I'll
> go read the paper.
>
> - Josh / eggyknap
From | Date | Subject | |
---|---|---|---|
Next Message | Joshua Tolley | 2008-11-03 00:41:55 | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets |
Previous Message | Greg Stark | 2008-11-02 23:28:49 | Re: WIP: Hash Join-Filter Pruning using Bloom Filters |