Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword

From: Torge <torgato(at)posteo(dot)de>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword
Date: 2022-09-01 01:07:09
Message-ID: 0e8b1eb9-97bf-0ddf-a487-ac4e4f6e0186@posteo.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear Postgres Devs,

I use Postgres everywhere I can and for me it is the by far best
database system out there. Great job, thank you!

Now I would like to humbly propose a feature that gives an easy way to
get a quick count estimate for any condition - index based or not -
based on a random sample of rows, that does not require a custom
function creation or complex SQL statement

The final syntax could be for example:

    SELECT count_estimate(2000) FROM my_table WHERE ...

or, since peerce in the chat pointed out that aggregate functions always
only get hit after the WHERE condition is already evaluated:

    SAMPLE count(*) FROM my_table WHERE ... SAMPLESIZE 2000

The idea of this function is to estimate the total number of rows by
checking a random sample size of them for the WHERE condition.

1. Get an as good as possible but quick estimate of max number of rows.
(rowcount)
2. Select random sample size of all rows (samplesize)
3. check if where condition matches (hits)
4. calculate and return estimate using rowcount * hits / samplesize

The higher the sample size, the more precise is the count, so the
developer can decide on the tradeoff between accuracy and speed.
Ofc this has it's limitations if the sample size is so small, or the
actual results so few, that only a few hits - if any at all - make it
into it.
So maybe a min_hits value should be considered as well, so the sample
size is increased until at least that many hits are found or an exact
count was produced (or until an optional max_samplesize was reached)

Typical scenario: highly flexible searching/filtering data with paging
mechanism and "total results" preview

I think this would find a lot of fans. When once more hitting the famous
count(*) performance issue I read again a lot of comments,
stackoverflows and mail threads discussing it, possible workarounds and
solutions, one more complex than the other.
I ended up writing my own SQL statement that achieves the above and I
think it meets the requirements of many who started discussions, but I
really would love to see this become a supported feature so it is easier
to use and most likely far more performant.

Greetings,
     Torge.

PS:

Here my current SQL level statement to achieve this. Due to limitations
I couldn't work around, it works slightly different. Instead of the row
count, it uses the max ID of the table, which ofc has to exist to even work.
Using reltuples was suggested, but in my system it is already often off
by more than 10% and I don't know how to select random rows in that case
(efficiently) while I can easily generate random IDs in the space of
possible IDs.

My statement returns an estimate in ~ 1 second on a 100M table with a
not index supported WHERE matching 10M entries that is roughly 5% off
while an exact count takes over 20 seconds

WITH vars AS (
    SELECT (SELECT my_id FROM my_table ORDER BY my_id DESC LIMIT 1) AS
max_id,  --the maximum ID currently used in this table
    2000 AS sample_size --The number of entries to sample. The higher
the more precise the estimate
),
random_rows AS ( --Create a temporary table with sample_size number of
random rows
    SELECT *
    FROM my_table
    WHERE my_id = ANY (ARRAY(
        SELECT (random() * (SELECT max_id FROM vars))::int --Select a
random id from 0 to max_id
        FROM GENERATE_SERIES(1, (SELECT sample_size FROM vars))
--generate sample_size number of rows
    ))
)
SELECT (SELECT max_id FROM vars) * count(*) / (SELECT sample_size from
vars)::int AS estimated_count
FROM random_rows
WHERE ... --Any where condition possible

There was a concern mentioned in chat that the ID might have holes,
especially as the sequence is not reset on failed transactions or when
stuff is deleted. But this should not matter much, it reduces the
accuracy, but could be countered by a bigger sample size. Also a min_id
could easily be added in cases where old rows get deleted.

Example:

table has 100M entries
Condition would match 10M entries
highest ID is 300.000.000
so only 1/3rd of IDs really exist in the table.
Sample size: 3000

We try to fetch 3000 rows by probing random IDs resulting in roughly
1000 actual rows, lets say 1012
We compare those rows against our WHERE condition and match 96

The resulting estimate is maxId * hits / samplesize = 300.000.000 * 96 /
3000 = 9.600.000

The holes, no matter where they are, while reducing the precision, are
not a general problem.
This becomes clearer if we first calculate the estimated number of
existing rows based on the data and then continue from there, which will
in the end return the same result

actualRowEstimate = maxId * fetchedRows / samplesize = 300.000.000 *
1012 / 3000 = 101.200.000
estimate = actualRowEstimate * hits / fetchedRows = 101.200.000 * 96 /
1012 = 9.600.000  => same number

Because:
estimate = actualRowEstimate * hits / fetchedRows
= (maxId * fetchedRows / samplesize) * hits / fetchedRows
= maxId * hits / samplesize

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2022-09-01 01:47:32 Re: Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword
Previous Message Wesley Schwengle 2022-08-31 18:53:30 View definition changes after reloading pg_dump export