Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date: 2015-06-20 15:28:56
Message-ID: 558586B8.8090803@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/20/2015 04:17 PM, Robert Haas wrote:
> On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I'm currently running some tests on a 3TB TPC-H data set, and I
>> tripped over a pretty bad n_distinct underestimate, causing OOM in
>> HashAgg (which somehow illustrates the importance of the
>> memory-bounded hashagg patch Jeff Davis is working on).
>
> Stupid question, but why not just override it using ALTER TABLE ...
> ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the
moment. But I wasn't really sure why we're producing so poor estimate
initially, even considering how small the sample is, it seemed a bit too
bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be
awesome? I'd much rather not force users to use the manual override if
possible.

>
> I think it's been discussed quite often on previous threads that you
> need to sample an awful lot of the table to get a good estimate for
> n_distinct. We could support that, but it would be expensive, and it
> would have to be done again every time the table is auto-analyzed.
> The above syntax supports nailing the estimate to either an exact
> value or a percentage of the table, and I'm not sure why that isn't
> good enough.

Not really. Using a larger sample would certainly help in most cases, in
this case the main problem is that the sample is not as random needed.
It produces way more duplicate values than it should, which then utterly
confuses the estimator which assumes random sample. This happens because
the column is

I've lobotomized the sampling a bit to really produce a random set of
blocks first, and that produces way better estimates:

statistics target estimate random
-----------------------------------------------------------------
100 429491 (10000x) 334430766 (14x)
1000 4240418 (1000x) 439010499 (10x)

Also, the number of sampled blocks is not that different. With target
100, the current sampler reads ~2900 blocks, while a completely random
sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators.
I do realize the random sampler assumes the tuple density is about the
same on all blocks (so that it can sample the blocks directly), but ISTM
the current block sampler has to do basically the same assumption to
skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Feng Tian 2015-06-20 15:29:33 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Previous Message Magnus Hagander 2015-06-20 15:15:11 Re: pg_stat_*_columns?