From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> |
Cc: | <josh(at)agliodbs(dot)com>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net> |
Subject: | Re: Gsoc2012 idea, tablesample |
Date: | 2012-05-11 13:54:35 |
Message-ID: | 87B3326E-03B2-4724-8FF8-2C07AA0FC854@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
[ Sorry for the self-reply ]
On May11, 2012, at 13:17 , Florian Pflug wrote:
> Actually, thinking more about this, if the approach sketched above
> turns out to work, then one might be able to get away without remembering
> previously computed TIDs, thus removing a lot of complexity.
>
> Say, for example, you simply start out with a single random TID tid[0].
> The you produce the sequence of "random" TIDs by setting
>
> tid[n+1] = tid[n] + random(1 <= x <= 2*D-1)
>
> where D is the expected distance between one TID from the sample set
> and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
> tid[n] >) #TIDs). This will give you approximately uniformly distributed
> TIDs, I think, and will even return them in physical order. The 100$ question
> is, by how much does this violate the independence requirement, i.e. how
> far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
> Some search through the literature should be able to answer that.
Actually, one can easily do better then that. Say you've determined that
to pick each tuple with probability p_tup, you need to pick each TID with
probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability
of a single TID being live). Now, looping over all TIDs and picking each with
probability p_tid is, of course, just a another (more error-prone) way of
iterating over all tuples and picking each with probability p_tup. But instead
of looping, you can jump from from picked TID to the next. Observe that, after
having picked tid X, the probability of picking X+i as the next tid is
(1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs
and then pick the i-th one. Thus, the following algorithm returns a sorted list
of TIDs which should be uniformly distributed and independent (or so I think,
at least)
1) Starting with a random tid tid[0], chosen such that
P(tid[0] = i) = (1 - p_tid)^i * p_tid
2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that
P(d = i) = (1 - p_tid)^(i-1) * p_tid
3) Abort if max(tid) is >= #TIDs.
This all hinges on the ability to produce a sufficient accurate estimate of the
TID density p_tup/p_tid, of course.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Sandro Santilli | 2012-05-11 13:59:51 | Re: Gsoc2012 idea, tablesample |
Previous Message | Andrew Dunstan | 2012-05-11 13:51:49 | Re: Draft release notes complete |