Quick Links

Hash cost

From:	Dennis Haney <davh(at)diku(dot)dk>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hash cost
Date:	2004-03-30 13:06:21
Message-ID:	406970CD.2060808@diku.dk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Could someone please try and explain why the cost estimator for the hash
is implemented as it is? (cost_hashjoin in costsize.c)
Especially these issues:

First, there is the estimation on the number of rows and their size.
ExecChooseHashTableSize() apparently trusts neither and doubles them.
Thus the function estimates the input relation is 4 times larger than
the rest of the optimizer thinks. Why is that?
And why is this doubling also applied to the size of HashJoinTupleData?
But not applied twice to the size of estimated bytes the hash would use,
a number used in the calculation on the number of batches?

Second, why does the optimizer first guess that there can be 10 values
in a bucket and then afterwards spend a lot of time estimating this
number for use in another calculation? Using numbers that was based on
the guess that there can be 10 values...

Third, the calculation assumes that the most common value is most
dominant by far, but that the other common value mean nothing?

Fourth, the hashfunction does not create any collisions between
non-identical values? And multiple join qualifiers does not affect this
either?

Fifth, a probe most often looks in a chain with the average number of
buckets? I would assume that a lot more time is spent looking in the
chains with the most buckets...

--
Dennis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fabien COELHO	2004-03-30 13:16:01	with vs without oids in pg_catalog.*
Previous Message	Gavin Sherry	2004-03-30 08:20:08	Re: pg_dump end comment