Quick Links

Re: tablesample performance

From:	Francisco Olarte <folarte(at)peoplecall(dot)com>
To:	Andy Colson <andy(at)squeakycode(dot)net>
Cc:	pgsql <pgsql-general(at)postgresql(dot)org>
Subject:	Re: tablesample performance
Date:	2016-10-18 16:44:14
Message-ID:	CA+bJJbxOnJ2DJvVYT7aq8cgnGwem-zJPDjAMDZ1HHbPu6_WrhQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy(at)squeakycode(dot)net> wrote:
> I wanted to report an awesome performance boost using tablesample.
> In my stored function I was getting a random row using:
> select one into x from ones order by random() limit 1;
> When the table was smaller it worked fine, but the performance has slowly
> gotten worse. This morning I was getting around 8 transactions a second.

Which is not a surprise, as it has to at least read all the rows and
generate a random() for each one and keep track of the minimum.

> I just replaced it with:
> select one into x from ones tablesample bernoulli(1) limit 1;

This should be faster, but to me it seems it does a different thing.
This seems to select each row of the table with probability 1% and
return the first selected, i.e., something similar to

select one into x from ones where random()>0.01 limit 1.

Which has the ( diminishing with table size ) risk of selecting zero
rows and is going to select one of the first 100 or so rows with high
probability, unless I'm missing something.

I say this because docs state ir returns a 'randomly chosen', sample,
not a 'randomly ORDERED' one, and the straightforward implementation
of sampling returns rows in the primitive scan order. I supose it
could be easily tested by selecting bernouilli(100), but have not
server access now to verify it.

With a big table it seems:

select one into x from ones where random()>0.01 order by random() limit 1
or
select one into x from ones tablesample bernoulli(1) order by random() limit 1;

Is more similar to what you originally did ( and the run time should
possibly be something in between ).

I would recomend you to execute the function and verify it does what
you want ( as you say it's fast, I would try selecting a several
thousands and eyeballing the result, if it does what I fear the
grouping should be obvious ).

Maybe you do not mind it, in which case it's ok, but a one minute run
should let you know wahat you are exactly doing.

Francisco Olarte.

In response to

tablesample performance at 2016-10-18 15:06:51 from Andy Colson

Responses

Re: tablesample performance at 2016-10-18 17:17:01 from Andy Colson

Browse pgsql-general by date

	From	Date	Subject
Next Message	Edilmar LISTAS	2016-10-18 16:49:37	Re: Problem changing default data_directory in PG 9.6 + CentOS6
Previous Message	Jeff Janes	2016-10-18 16:09:35	Re: Dump all the indexes/constraints/roles