Florian Pflug <fgp(at)phlo(dot)org> wrote:
> Maybe one can get rid of these sorts of problems by factoring in
> the expected density of the table beforehand and simply accepting
> that the results will be inaccurate if the statistics are
> outdated?
>
> One could, for example, simply pick
>
> N := SamplingPercentage * MaxTuplesPerPage /
> AvgLiveTuplesPerPage
>
> where
>
> AvgLiveTuplesPerPage := #Tuples / #Pages
>
> random TIDs, fetch the live ones, and return them.
To clarify, I read this as using reltuples and relpages for the
table, and returning only tuples which are visible according to the
query's snapshot. (i.e., I think you used "live" to mean two
different things there.)
Unless I'm missing something, I think that works for percentage
selection, which is what the standard talks about, without any need
to iterate through addition samples. Good idea! We don't need to
do any second pass to pare down initial results, either. This
greatly simplifies coding while providing exactly what the standard
requires.
> I'm not totally sure whether this approach is sensible to
> non-uniformity in the tuple to line-pointer assignment, though.
I think we can solve that by going high enough with tuple numbers to
reach the highest tuple ID that might be in use in the table, and
*not* following HOT chains. (If we follow HOT chains, we could have
several distinct ctid values which returned the same tuple.) Or am
I thinking about this incorrectly?
> [more complex alternatives]
I really think your first suggestion covers it perfectly; these more
complex techniques don't seem necessary to me.
-Kevin