Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Date: 2015-08-06 20:10:52
Message-ID: CANP8+jKp6k5YQGeG=CrhxtU0SCVyNjYC-kW4CqAAYGDoLeXHgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 August 2015 at 20:14, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> This table has around 185 rows per page. As the sample size goes up,
> the number times I get zero rows goes down, but those results seem to
> still include data pages with zero rows. For example, here's a series
> of results from a 0.04 sample against the million-row table.
>
> 370
> 370
> 370
> 555
> 555
> 185
> 0
> 925
>
> Since this is a synthetic table I just generated, it contains almost
> exactly 185 rows per data page for every data page. So on a 0.04%
> sample, the variation between 370 rows and 555 rows (whether we have 2
> or 3 data pages) is expected, since 0.04% of 5406 data pages is 2.16 pages.
>
> The results of 0, 185 and 925 are not. It really seems like SYSTEM is
> treating 0.04% as a maximum, but taking a random number of data pages
> somewhere around that maximum, using math which can choose numbers of
> pages far outside of the % requested by the user, and which includes 0.
>
> Speaking from a user perspective, SYSTEM seems broken to me. I can't
> imagine using it for anything with a that degree of variation in the
> number of results returned, especially if it's possible to return zero
> rows from a populated table.
>

The docs say...

The SYSTEM method does block-level sampling with each block having the
specified chance of being selected; all rows in each selected block are
returned.

SQLStandard suggests the parameter value should be an integer, which would
limit it to 1% sample. Allowing a non-integer sample size is a PostgreSQL
extension. There is no guidance on how to implement SYSTEM in the standard.

Given, user-stated probability of accessing a block of P and N total
blocks, there are a few ways to implement block sampling.

1. Test P for each block individually. This gives a range of possible
results, with 0 blocks being possible outcome, though decreasing in
probability as P increases for fixed N. This is the same way BERNOULLI
works, we just do it for blocks rather than rows.

2. We calculate P/N at start of scan and deliver this number blocks by
random selection from N available blocks.

At present we do (1), exactly as documented. (2) is slightly harder since
we'd need to track which blocks have been selected already so we can use a
random selection with no replacement algorithm. On a table with uneven
distribution of rows this would still return a variable sample size, so it
didn't seem worth changing.

This is exactly why TABLESAMPLE was designed to allow you to write your own
sampling algorithm, so you can do it anyway you want.

The context here is that SELECT count(*) from the test table takes 10ms.
The intended use case for this feature is against tables that would give
some kind of performance problem, which isn't the case here. On a table
100x larger, the sample is 1000 +/- 40, which seems more acceptable. The
variability of Bernoulli is much less, but then it is roughly 100x slower.
On a 1TB table, SYSTEM is going to be your only choice.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-08-06 20:14:51 Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Previous Message Bruce Momjian 2015-08-06 20:08:16 Re: 9.5 release notes