From: | Bryce Cutt <pandasuit(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |
Date: | 2009-03-21 00:45:14 |
Message-ID: | 1924d1180903201745i7e3f6876w2e108571e3b1843d@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Not necessarily true. Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique. There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets. The potential win is still pretty high though.
- Bryce Cutt
On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
>>> Here is the new patch.
>>
>> Applied with revisions. I undid some of the "optimizations" that
>> cluttered the code in order to save a cycle or two per tuple --- as per
>> previous discussion, that's not what the performance questions were
>> about. Also, I did not like the terminology "in-memory"/"IM"; it seemed
>> confusing since the main hash table is in-memory too. I revised the
>> code to consistently refer to the additional hash table as a "skew"
>> hashtable and the optimization in general as skew optimization. Hope
>> that seems reasonable to you --- we could search-and-replace it to
>> something else if you'd prefer.
>>
>> For the moment, I didn't really do anything about teaching the planner
>> to account for this optimization in its cost estimates. The initial
>> estimate of the number of MCVs that will be specially treated seems to
>> me to be too high (it's only accurate if the inner relation is unique),
>> but getting a more accurate estimate seems pretty hard, and it's not
>> clear it's worth the trouble. Without that, though, you can't tell
>> what fraction of outer tuples will get the short-circuit treatment.
>
> If the inner relation isn't fairly close to unique you shouldn't be
> using this optimization in the first place.
>
> ...Robert
>
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2009-03-21 00:51:00 | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |
Previous Message | Robert Haas | 2009-03-21 00:35:53 | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |