From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Greg Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> |
Subject: | Re: HashJoin w/option to unique-ify inner rel |
Date: | 2009-04-26 04:14:29 |
Message-ID: | 603c8f070904252114m7a330f5fice7fadd3b214d43d@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 24, 2009 at 10:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> As far as I can tell, the focus on trying to estimate the number of
>> tuples per bucket is entirely misguided. Supposing the relation is
>> mostly unique so that the values don't cluster too much, the right
>> answer is (of course) NTUP_PER_BUCKET.
>
> But the entire point of that code is to arrive at a sane estimate
> when the inner relation *isn't* mostly unique and *does* cluster.
> So I think you're being much too hasty to conclude that it's wrong.
A few further thoughts on this... the patch you attached computes
jselec (percentage of tuples with a match) and avgmatch (number of
matches per tuple with at least one match). This seems like a really
good idea, but perhaps we should do it even when performing an
inner/left join, if that's possible.
Regardless of the join type, when there is no match, we need to scan
the entire bucket. There's no reason to assume we'll hit more
populated buckets more often. So, if there are 2^b virtual buckets
and n inner tuples, then the bucket we hit will on average contain
n/(2^b) tuples and the chances of a hash collision for any given tuple
in that bucket are about 2^-(32-b) multiplied by whatever fudge factor
you want to allow for imperfect hashing (call it f). The total number
of hash collisions per unmatched tuple will be equal to the product of
the number of tuples and the chance of a collision per tuple, or in
other words n/(2^b) * 2^-(32-b) * f = n * 2^-32 * f.
For an inner/left join, when there is a match, we still need to scan
the entire bucket, but now we know we'll have to evaluate the hash
quals at least avgmatch times. In addition, we risk having to
evaluate the hash quals for each tuple in the bucket that doesn't
match, if the hash values happen to collide. In this case we can only
have hash collisions with the tuples we're not matching, so the number
of hash qual evaluations should be about (n - avgmatch) * 2^-32 * f +
avgmatch.
For a semi/anti join, when there is a match, we only need to scan
until we hit the first matching tuple. If we let the number of
collisions, c, be (n - avgmatch) * 2^-32 * f, then the percentage of
the bucket we need to scan is about c / (c + avgmatch). Since they
might not be randomly distributed within the bucket, multiply by 2 as
in your existing patch, but with a maximum of 1.0. So the number of
hash qual evaluations will be:
MIN(1.0, 2.0 * c / (c + avgmatch) ) * (c + avgmatch) = MIN(c +
avgmatch, 2.0 * c)
...Robert
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2009-04-26 04:24:55 | Re: WIP: to_char, support for EEEE format |
Previous Message | Robert Haas | 2009-04-26 01:49:23 | Re: HashJoin w/option to unique-ify inner rel |