From: | "Greg Stark" <stark(at)enterprisedb(dot)com> |
---|---|
To: | "Simon Riggs" <simon(at)2ndquadrant(dot)com> |
Cc: | "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: benchmarking the query planner |
Date: | 2008-12-12 18:01:40 |
Message-ID: | 4136ffa0812121001x3fde1871t558fd8c58070b7d@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Incidentally we *do* do block sampling. We pick random blocks and then pick
>> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
>> around then. It dramatically reduced the i/o requirements but there were long
>> discussions of how to do it without introducing biases.
>
> No, we pick random rows. On bigger tables, they get further apart
> typically and so we miss any clustering. I mean that we should pick a
> random block and read all rows on it.
I think what's happening here is that our sampling method is based on
the assumption that a records location is not related to its value.
Consider a table which is clustered and has two copies of each value.
When we look at a block and see n/2 values and we know there are 1000
blocks then a human would conjecture that there are 1000*n/2 distinct
values throughout the table and every value is represented twice
throughout the whole table. But if we're assuming the records are
randomly distributed then looking at the whole block will actually
throw us off completely. We'll deduce from the fact that we saw every
value twice that there must be hundreds of copies spread throughout
the database and there must be a lot less than 1000*n/2 distinct
values.
I think you need to find two different formulas, one which represents
a clustered table and one which represents randomly distributed data.
Then you need a way to measure just how clustered the data is so you
know how much weight to give each formula. Perhaps comparing the
number of duplicates in whole-block samples versus overall random
selections would give that measure.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-12-12 18:02:34 | Re: benchmarking the query planner |
Previous Message | Bruce Momjian | 2008-12-12 17:59:51 | Re: Updates of SE-PostgreSQL 8.4devel patches (r1268) |