Re: Perfomance of IN-clause with many elements and possible solutions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Dmitry Lazurkin <dilaz03(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Perfomance of IN-clause with many elements and possible solutions
Date: 2017-07-25 03:11:09
Message-ID: 466.1500952269@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"David G. Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
> On Mon, Jul 24, 2017 at 3:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The cost to form the inner hash is basically negligible whether it's
>> de-duped or not, but if it's not (known) de-duped then the cost
>> estimate for the semijoin is going to rise some, and that discourages
>> selecting it.

> ​Why does the "hash semi join" care about duplication of values on the
> inner relation? Doesn't it only care whether a given bucket exists
> irrespective of its contents?

No, it cares about the average bucket size, ie number of entries that
a probe will have to look at. The non-de-duped inner relation can't
be assumed to have a flat distribution among the buckets.

> Pointing me to the readme or code file (comments) that explains this in
> more detail would be welcome.

Look for estimate_hash_bucketsize in selfuncs.c and its caller in
costsize.c. The key point is this argument in that function's
header comment:

* We are passed the number of buckets the executor will use for the given
* input relation. If the data were perfectly distributed, with the same
* number of tuples going into each available bucket, then the bucketsize
* fraction would be 1/nbuckets. But this happy state of affairs will occur
* only if (a) there are at least nbuckets distinct data values, and (b)
* we have a not-too-skewed data distribution. Otherwise the buckets will
* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something close
* to the "worst case" bucket size. We try to estimate this by adjusting the
* fraction if there are too few distinct data values, and then scaling up
* by the ratio of the most common value's frequency to the average frequency.
*
* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).

That's certainly a bit pessimistic, but the thing to remember about any
hashing algorithm is that occasionally it will eat your lunch. We prefer
to go to sort-based algorithms if there seems to be a risk of the hash
degenerating to O(N^2).

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2017-07-25 05:09:07 Re: Perfomance of IN-clause with many elements and possible solutions
Previous Message David G. Johnston 2017-07-25 03:03:24 Re: Perfomance of IN-clause with many elements and possible solutions