From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> |
Cc: | Dmitry Lazurkin <dilaz03(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Perfomance of IN-clause with many elements and possible solutions |
Date: | 2017-07-25 03:11:09 |
Message-ID: | 466.1500952269@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
"David G. Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
> On Mon, Jul 24, 2017 at 3:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The cost to form the inner hash is basically negligible whether it's
>> de-duped or not, but if it's not (known) de-duped then the cost
>> estimate for the semijoin is going to rise some, and that discourages
>> selecting it.
> Why does the "hash semi join" care about duplication of values on the
> inner relation? Doesn't it only care whether a given bucket exists
> irrespective of its contents?
No, it cares about the average bucket size, ie number of entries that
a probe will have to look at. The non-de-duped inner relation can't
be assumed to have a flat distribution among the buckets.
> Pointing me to the readme or code file (comments) that explains this in
> more detail would be welcome.
Look for estimate_hash_bucketsize in selfuncs.c and its caller in
costsize.c. The key point is this argument in that function's
header comment:
* We are passed the number of buckets the executor will use for the given
* input relation. If the data were perfectly distributed, with the same
* number of tuples going into each available bucket, then the bucketsize
* fraction would be 1/nbuckets. But this happy state of affairs will occur
* only if (a) there are at least nbuckets distinct data values, and (b)
* we have a not-too-skewed data distribution. Otherwise the buckets will
* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something close
* to the "worst case" bucket size. We try to estimate this by adjusting the
* fraction if there are too few distinct data values, and then scaling up
* by the ratio of the most common value's frequency to the average frequency.
*
* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).
That's certainly a bit pessimistic, but the thing to remember about any
hashing algorithm is that occasionally it will eat your lunch. We prefer
to go to sort-based algorithms if there seems to be a risk of the hash
degenerating to O(N^2).
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2017-07-25 05:09:07 | Re: Perfomance of IN-clause with many elements and possible solutions |
Previous Message | David G. Johnston | 2017-07-25 03:03:24 | Re: Perfomance of IN-clause with many elements and possible solutions |