Quick Links

Re: ANALYZE sampling is too good

From:	Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To:	Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: ANALYZE sampling is too good
Date:	2013-12-08 22:19:48
Message-ID:	52A4F084.3070200@catalyst.net.nz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 09/12/13 08:03, Josh Berkus wrote:
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block. That's what I meant by
> "block-based sampling"; you read, say, 400 pages, you compile statistics
> on *all* of the rows on those pages, you apply some algorithms to adjust
> for groupings of rows based on how grouped they are. And you have a
> pretty good estimate of how grouped they are, because you just looked a
> complete sets of rows on a bunch of nonadjacent pages.
>
> Obviously, you need to look at more rows than you would with a
> pure-random sample. Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample. However, since those rows come from the
> same pages, the cost of looking at more rows is quite small, compared to
> the cost of looking at 64 times as many disk pages.
>
> My ACM subscription has lapsed, though; someone with a current ACM
> subscription could search for this; there are several published papers,
> with math and pseudocode.
>

This makes more sense! Unfortunately you had confused everyone (well me
at least) by decreeing that we needed block based sampling - when we
started using that in 2004 - there is more than one way to sample blocks
it seems :-)

What you are actually suggesting is a way to do block based sampling
that requires reading fewer blocks, and that is a great idea! Of course
as Greg has just suggested - the details are gonna be the clincher. Can
you supply authors or titles of papers?

Also with reference to Greenplum - their use of postgres is fairly
special case, and an approach that works well for them is not always
suitable for more general purpose use.

Regards

Mark

In response to

Re: ANALYZE sampling is too good at 2013-12-08 19:03:02 from Josh Berkus

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2013-12-09 01:46:29	Re: Possible work-around for 9.1 partial vacuum bug?
Previous Message	Jim Nasby	2013-12-08 21:33:39	Possible work-around for 9.1 partial vacuum bug?