Quick Links

Re: ANALYZE sampling is too good

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: ANALYZE sampling is too good
Date:	2013-12-09 18:54:21
Message-ID:	CAM-w4HMFAfAAUp5YDEFkRTNt7UbrxT-9ecAJUNgwv2DyC=weGg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
> It's also applicable for the other stats; histogram buckets constructed
> from a 5% sample are more likely to be accurate than those constructed
> from a 0.1% sample. Same with nullfrac. The degree of improved
> accuracy, would, of course, require some math to determine.

This "some math" is straightforward basic statistics. The 95th
percentile confidence interval for a sample consisting of 300 samples
from a population of a 1 million would be 5.66%. A sample consisting
of 1000 samples would have a 95th percentile confidence interval of
+/- 3.1%.

The histogram and nullfact answers the same kind of question as a
political poll, "what fraction of the population falls within this
subset". This is why pollsters don't need to sample 15 million
Americans to have a decent poll result. That's just not how the math
works for these kinds of questions.

n_distinct is an entirely different kettle of fish. It's a different
kind of problem and the error rate there *is* going to be dependent on
the percentage of the total population that you sampled. Moreover from
the papers I read I'm convinced any sample less than 50-80% is nearly
useless so I'm convinced you can't get good results without reading
the whole table.

--
greg

In response to

Re: ANALYZE sampling is too good at 2013-12-09 18:03:50 from Josh Berkus

Responses

Re: ANALYZE sampling is too good at 2013-12-09 18:56:48 from Greg Stark
Re: ANALYZE sampling is too good at 2013-12-10 08:28:23 from Albe Laurenz

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2013-12-09 18:56:45	Re: About shared cache invalidation mechanism
Previous Message	Jim Nasby	2013-12-09 18:51:01	Re: plpgsql_check_function - rebase for 9.3