From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H |
Date: | 2015-06-19 19:48:36 |
Message-ID: | CAMkU=1zEV+Pwx=QBKTPSGU=qjvqVkpAsHi0JbEKWjQifmV4TYQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> wrote:
> But I think you might be on to something, because I manually collected a
> random sample with 30k rows (by explicitly generating 30k random TIDs), and
> I get this:
>
> tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from
> lineitem_sample group by 1) foo group by 1;
>
> cnt | count
> -----+-------
> 1 | 29998
> 2 | 1
> (2 rows)
>
>
> That's quite different compared to what analyze gets, which effectively
> looks something like this (this is derived from the logs, so not perfectly
> accurate - I only have f1, ndistinct, nmultiple):
>
> cnt | count
> -----+-------
> 1 | 27976
> 2 | 976
> 3 | 24
>
> Am I wrong or is the sample not that random?
The sample is not truly random. The two-stage sampling method causes too
few blocks to have exactly one row chosen from them, and too many to have
either 0 or 2+ rows chosen from them.
When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.
See analysis here:
If we assume all the blocks have the same tuple density, then it is easy to
correct this. But without that assumption of constant tuple density, I
don't know how to get a truly random sample while still skipping most of
the table.
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Brendan Jurd | 2015-06-19 20:01:52 | Re: Tab completion for TABLESAMPLE |
Previous Message | Andres Freund | 2015-06-19 19:44:04 | Re: 9.5 release notes |