From: | Andy Colson <andy(at)squeakycode(dot)net> |
---|---|
To: | Francisco Olarte <folarte(at)peoplecall(dot)com> |
Cc: | pgsql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: tablesample performance |
Date: | 2016-10-18 17:17:01 |
Message-ID: | 28f2c186-c5f6-3266-6527-eae6f96919e5@squeakycode.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 10/18/2016 11:44 AM, Francisco Olarte wrote:
> On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy(at)squeakycode(dot)net> wrote:
>> I wanted to report an awesome performance boost using tablesample.
>> In my stored function I was getting a random row using:
>> select one into x from ones order by random() limit 1;
>> When the table was smaller it worked fine, but the performance has slowly
>> gotten worse. This morning I was getting around 8 transactions a second.
>
> Which is not a surprise, as it has to at least read all the rows and
> generate a random() for each one and keep track of the minimum.
>
>> I just replaced it with:
>> select one into x from ones tablesample bernoulli(1) limit 1;
>
> This should be faster, but to me it seems it does a different thing.
> This seems to select each row of the table with probability 1% and
> return the first selected, i.e., something similar to
>
> select one into x from ones where random()>0.01 limit 1.
>
> Which has the ( diminishing with table size ) risk of selecting zero
> rows and is going to select one of the first 100 or so rows with high
> probability, unless I'm missing something.
>
> I say this because docs state ir returns a 'randomly chosen', sample,
> not a 'randomly ORDERED' one, and the straightforward implementation
> of sampling returns rows in the primitive scan order. I supose it
> could be easily tested by selecting bernouilli(100), but have not
> server access now to verify it.
>
> With a big table it seems:
>
> select one into x from ones where random()>0.01 order by random() limit 1
> or
> select one into x from ones tablesample bernoulli(1) order by random() limit 1;
>
> Is more similar to what you originally did ( and the run time should
> possibly be something in between ).
>
>
> I would recomend you to execute the function and verify it does what
> you want ( as you say it's fast, I would try selecting a several
> thousands and eyeballing the result, if it does what I fear the
> grouping should be obvious ).
>
> Maybe you do not mind it, in which case it's ok, but a one minute run
> should let you know wahat you are exactly doing.
>
> Francisco Olarte.
>
Ah, yes, you're right, there is a bit of a difference there.
Speed wise:
1) select one from ones order by random() limit 1;
> about 360ms
2) select one from ones tablesample bernoulli(1) limit 1 ;
> about 4ms
3) select one from ones tablesample bernoulli(1) order by random() limit 1;
> about 80ms
Using the third option in batch, I'm getting about 15 transactions a second.
Oddly:
select one from ones tablesample bernoulli(0.25) order by random()
takes almost 80ms also.
bernoulli(0.25) returns 3k rows
bernoulli(1) returns 14k rows
Thanks,
-Andy
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-10-18 17:34:32 | Re: tablesample performance |
Previous Message | Edilmar LISTAS | 2016-10-18 17:12:28 | Re: Problem changing default data_directory in PG 9.6 + CentOS6 |