Quick Links

Re: O(samplesize) tuple sampling, proof of concept

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: O(samplesize) tuple sampling, proof of concept
Date:	2004-04-05 22:23:19
Message-ID:	s4j370l6ra56tvodcbg9baaf682q6cu3pn@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches

On Mon, 05 Apr 2004 15:37:07 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>I wouldn't bother with a GUC variable for the production patch.

Among other things the GUC variable will be thrown out for the final
version.

>> Once a block is selected for inspection, all tuples of this
>> block are accessed to get a good estimation of the live : dead row
>> ratio.
>
>Why are you bothering to compute the live:dead ratio?

That was basically bad wording. It should have been "... to get a good
estimation of live rows per page." Counting dead rows turned out to be
trivial, so I just did it and included the number in the debug messages.
Then it happened to be useful for method 2.

>> Because I was afraid that method 1 might be too expensive in terms of
>> CPU cycles, I implemented a small variation that skips tuples without
>> checking them for visibility; this is sampling_method 2.
>
>And? Does it matter?

There's a clearly visible difference for mid-size relations. I'm not
sure whether this can be attributed to visibility bit updating or other
noise-contributing factors.

Method 2 gives a row count estimation error between 10 and 17% where
method 1 is less than 1% off. (My updates generated dead tuples at very
evenly distributed locations by using conditions like WHERE mod(twenty,
7) = 0).

>If that's as bad as it gets I think we are OK. You should redo the test
>with larger sample size though (try stats target = 100) to see if the
>answer changes.

Will do.

>I find -u diffs close to unreadable for reviewing purposes. Please
>submit diffs in -c format in future.

De gustibus non est disputandum :-)
Fortunately this patch wouldn't look much different. There is just a
bunch of "+" lines.

>AFAICS the rows will *always* be sorted already, and so this qsort is an
>utter waste of cycles. No?

I don't think so. We get the *blocks* in the correct order. But tuples
are still sampled by the Vitter method which starts to replace random
tuples after the pool is filled.

BTW, thanks for the paper!

Servus
Manfred

In response to

Re: O(samplesize) tuple sampling, proof of concept at 2004-04-05 19:37:07 from Tom Lane

Responses

Re: O(samplesize) tuple sampling, proof of concept at 2004-04-05 22:30:29 from Tom Lane

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Tom Lane	2004-04-05 22:30:29	Re: O(samplesize) tuple sampling, proof of concept
Previous Message	Tom Lane	2004-04-05 19:37:07	Re: O(samplesize) tuple sampling, proof of concept