Re: benchmarking the query planner

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 09:35:49
Message-ID: 1229074549.13078.213.camel@hp_dx2400_1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Thu, 2008-12-11 at 18:52 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:
> >>> And I would like it even more if the sample size increased according
> >>> to table size, since that makes ndistinct values fairly random for
> >>> large tables.
> >>
> >> Unfortunately _any_ ndistinct estimate based on a sample of the table
> >> is going to be pretty random.
>
> > We know that constructed data distributions can destroy the
> > effectiveness of the ndistinct estimate and make sample size irrelevant.
> > But typical real world data distributions do improve their estimations
> > with increased sample size and so it is worthwhile.
>
> This is handwaving unsupported by evidence.

Not at all.

In the paper cited within the ANALYZE code, shown here:
ftp://ftp.research.microsoft.com/users/autoadmin/histogram_conf.pdf
we implement the sample size for reliable histogram size, but ignore
most of the rest of the paper.

Sections (4) Block Level sampling is ignored, yet the conclusions are
(7.2) that it provides a larger and more effective sample size yet
without significantly increasing number of accessed disk blocks.

Haas Stokes [1998] also indicate that accuracy of n-distinct estimation
is linked to sample size.

In a previous post to hackers I looked at the case where values were
physically clustered together in the table, either naturally or via the
CLUSTER command. Section: ESTIMATES OF D FOR DEPENDENT TABLES
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00153.php

In that case the current ANALYZE algorithm fails badly because of fixed
sample size. This is because "ANALYZE randomly samples rows, so that the
average gap between randomly
selected rows increases as the table size increases, because of the
fixed sample size. Since the clustered rows are typically close
together, then the apparent number of multiple instances of the same
data value decreases as the sample fraction decreases. Since the sample
size is currently fixed, this means that the D estimate decreases as the
table size increases. (This is proven in a test case below)."

> If you've got a specific
> proposal what to change the sample size to and some numbers about what
> it might gain us or cost us, I'm all ears.

So my specific proposal is: implement block level sampling.

It allows us to
* increase sample size without increasing number of I/Os
* allows us to account correctly for clustered data

I'm not trying to force this to happen now, I'm just bringing it into
the discussion because its relevant and has not been mentioned.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message ITAGAKI Takahiro 2008-12-12 09:51:38 contrib/pg_stat_statements 1212
Previous Message Simon Riggs 2008-12-12 09:04:35 Re: benchmarking the query planner