Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Date: 2015-08-06 19:14:42
Message-ID: 55C3B222.1060301@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Version: 9.5alpha2

Issue: when requesting small samples, SYSTEM often returns zero rows,
and sometimes returns unexpected numbers of rows.

Example:

create table thous ( id int, val text );
insert into thous select i, i::text || '-val' from
generate_series(1,100000) as gs(i);
analyze;

This is a 100,000 row table, so a 0.01% sample should be 10 rows. Since
10 rows is far less than what's on one data page, the documentation
suggests that I should get 1 data page worth of rows. However:

postgres=# select * from thous tablesample system ( 0.01 );
id | val
----+-----
(0 rows)

Time: 0.636 ms

... this query consistently returns 0 rows, in 20 runs.

Ok,let's try a million-row table, which is the example we have in the docs.

postgres=# select * from mil tablesample system ( 0.01 );
id | val
----+-----
(0 rows)

Hmmm?

On testing, the query against the million-row table returns 0 rows
around 50% of the time.

This table has around 185 rows per page. As the sample size goes up,
the number times I get zero rows goes down, but those results seem to
still include data pages with zero rows. For example, here's a series
of results from a 0.04 sample against the million-row table.

370
370
370
555
555
185
0
925

Since this is a synthetic table I just generated, it contains almost
exactly 185 rows per data page for every data page. So on a 0.04%
sample, the variation between 370 rows and 555 rows (whether we have 2
or 3 data pages) is expected, since 0.04% of 5406 data pages is 2.16 pages.

The results of 0, 185 and 925 are not. It really seems like SYSTEM is
treating 0.04% as a maximum, but taking a random number of data pages
somewhere around that maximum, using math which can choose numbers of
pages far outside of the % requested by the user, and which includes 0.

Speaking from a user perspective, SYSTEM seems broken to me. I can't
imagine using it for anything with a that degree of variation in the
number of results returned, especially if it's possible to return zero
rows from a populated table.

BERNOULLI works as expected.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-08-06 19:27:02 Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Previous Message Robert Haas 2015-08-06 18:46:17 Re: Race conditions in shm_mq.c