Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Just handling better the case where we pick a straight nested loop
> rather than a hash join would help a lot of people. Some basic
> conservatism about the number of outer rows would be useful here (in
> particular, we should probably assume that there will be at least 2
> when costing a nest loop, unless the outer side is known unique),
> and it's also worth thinking about the fact that a hash join won't
> build the table unless there is at least 1 outer row, which I don't
> think the current costing algorithm takes into account. Or maybe we
> should estimate the amount by which the nested loop figures to beat
> out the hash join and only accepted the nested loop plan if the win
> exceeds some threshold percentage.
But in our environment the most common cause of a sub-optimal planning
choice is over-estimating the cost of a nested loop. We've never been
able to get good plans overall without dropping random_page_cost to
twice the seq_page_cost or less -- in highly cached situations we
routinely drop both to 0.1. Creating further bias against nested loop
plans just means we'll have to push the numbers further from what the
purportedly represent. It seems to me a significant unsolved problem
is properly accounting for caching effects.
That said, I have not really clear idea on how best to solve that
problem. The only ideas which recur when facing these issues are:
(1) The the planner refused to deal with fractional estimates of how
many rows will be returned in a loop -- it treats 0.01 as 1 on the
basis that you can't read a fractional row, rather than as a 1% chance
that you will read a row and need to do the related work. I have
always thought that changing this might allow more accurate estimates;
perhaps I should hack a version which behaves that way and test it as
a "proof of concept." Note that this is diametrically opposed to your
suggestion that we always assume at least two rows in the absence of a
unique index.
(2) Somehow use effective_cache_size in combination with some sort of
current activity metrics to dynamically adjust random access costs.
(I know, that one's total hand-waving, but it seems to have some
possibility of better modeling reality than what we currently do.)
-Kevin