Quick Links

Re: Improving N-Distinct estimation by ANALYZE

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improving N-Distinct estimation by ANALYZE
Date:	2006-01-05 08:11:01
Message-ID:	1136448661.21025.251.camel@localhost.localdomain
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>
> > The approach I suggested uses the existing technique for selecting
> > random blocks, then either an exhaustive check on all of the rows in a
> > block or the existing random row approach, depending upon available
> > memory. We need to check all of the rows in a reasonable sample of
> > blocks otherwise we might miss clusters of rows in large tables - which
> > is the source of the problems identified.
> >
> > The other reason was to increase the sample size, which is a win in any
> > form of statistics.
>
> Only if your sample is random and independent. The existing mechanism tries
> fairly hard to ensure that every record has an equal chance of being selected.
> If you read the entire block and not appropriate samples then you'll introduce
> systematic sampling errors. For example, if you read an entire block you'll be
> biasing towards smaller records.

Yes, I discussed that, following Brutlag & Richardson [2000]. The bottom
line is if there is no clustering, block sampling is random, which is
good; if there is clustering, then you spot it, which is good.

> I think it would be useful to have a knob to increase the sample size
> separately from the knob for the amount of data retained in the statistics
> tables. Though I think you'll be disappointed and find you have to read an
> unreasonably large sample out of the table before you get more useful distinct
> estimates.

OK, I'll look at doing that.

Best Regards, Simon Riggs

In response to

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-05 05:33:55 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Marko Kreen	2006-01-05 10:31:28	Re: Heads up: upcoming back-branch re-releases
Previous Message	Josh Berkus	2006-01-05 06:31:03	Re: Improving N-Distinct estimation by ANALYZE