Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Nathan Boley" <npboley(at)gmail(dot)com>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Zeugswetter Andreas OSB sIT" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Date: 2008-06-10 23:03:43
Message-ID: 26169.1213139023@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Nathan Boley" <npboley(at)gmail(dot)com> writes:
>> (1) On what grounds do you assert the above?

> For a table with 1000000 non-mcv rows, the planner estimates a result
> set of cardinality 1000000/20 = 50000, not 1.

The real problem in that situation is that you need another twenty slots
in the MCV list. The MCV list should *always* exhaust the set of values
for which it'd be bad to do an indexscan. Assuming that the threshold
for switching to an indexscan is somewhere around selectivity 0.005
(I am not certain offhand, but it's in that general area), this cannot
possibly require more than 200 MCV slots, and for most data
distributions it'd be a whole lot less.

Given such an MCV list, the planner will always make the right choice
of whether to do index or seqscan ... as long as it knows the value
being searched for, that is. Parameterized plans have a hard time here,
but that's not really the fault of the statistics.

> I see where the confusion is coming from. When I said
> What if ( somehow ) we knew that there was a 90%
> chance that query would return an empty result set
> I meant that the planner doesn't know that information. And how could it?

Hmm. IIRC the estimates are set up on the assumption that you are
searching for a value that occurs in the table. I suppose there are
applications where that's often false, but as you say, it's hard to know
that in advance.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-06-10 23:19:36 Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Previous Message Nathan Boley 2008-06-10 21:31:01 Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics