Quick Links

Re: Gsoc2012 idea, tablesample

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, ants(at)cybertec(dot)at, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, huangqiyx(at)hotmail(dot)com, "Florian Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-11 15:35:28
Message-ID:	5346.1336750528@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> [ uniformly sample the TID space defined as (1..P, 1..M) ]

> Shouldn't that get us the randomly chosen sample we're looking for?
> Is there a problem you think this ignores?

Not sure. The issue that I'm wondering about is that the line number
part of the space is not uniformly populated, ie, small line numbers
are much more likely to exist than large ones. (In the limit that
density goes to zero, when you pick M much too large.) It's not clear
to me whether this gives an unbiased probability of picking real tuples,
as opposed to hypothetical TIDs.

Another issue is efficiency. In practical cases you'll have to greatly
overestimate M compared to the typical actual-number-of-tuples-per-page,
which will lead to a number of target TIDs N that's much larger than
necessary, which will make the scan slow --- I think in practice you'll
end up doing a seqscan or something that might as well be one, because
unless S is *really* tiny it'll hit just about every page. We can have
that today without months worth of development effort, using the "WHERE
random() < S" technique.

regards, tom lane

In response to

Re: Gsoc2012 idea, tablesample at 2012-05-11 15:04:46 from Kevin Grittner

Responses

Re: Gsoc2012 idea, tablesample at 2012-05-11 15:50:37 from Kevin Grittner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kevin Grittner	2012-05-11 15:36:16	Re: Gsoc2012 idea, tablesample
Previous Message	Robert Haas	2012-05-11 15:20:07	Re: Gsoc2012 idea, tablesample