From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Greg Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> |
Subject: | Re: HashJoin w/option to unique-ify inner rel |
Date: | 2009-04-25 02:49:23 |
Message-ID: | 24756.1240627763@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> As far as I can tell, the focus on trying to estimate the number of
> tuples per bucket is entirely misguided. Supposing the relation is
> mostly unique so that the values don't cluster too much, the right
> answer is (of course) NTUP_PER_BUCKET.
But the entire point of that code is to arrive at a sane estimate
when the inner relation *isn't* mostly unique and *does* cluster.
So I think you're being much too hasty to conclude that it's wrong.
> Because the extra tuples that get thrown into the bucket
> generally don't have the same hash value (or if they did, they would
> have been in the bucket either way...) and get rejected with a simple
> integer comparison, which is much cheaper than
> hash_qual_cost.per_tuple.
Yeah, we are charging more than we ought to for bucket entries that can
be rejected on the basis of hashcode comparisons. The difficulty is to
arrive at a reasonable guess of what fraction of the bucket entries will
be so rejected, versus those that will incur a comparison-function call.
I'm leery of assuming there are no hash collisions, which is what you
seem to be proposing.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2009-04-25 03:52:55 | Re: HashJoin w/option to unique-ify inner rel |
Previous Message | Robert Haas | 2009-04-25 02:37:25 | Re: HashJoin w/option to unique-ify inner rel |