Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date: 2015-06-20 16:55:44
Message-ID: 55859B10.2040200@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
> Hmmm, that's probably true. OTOH correlated columns are not all that
> uncommon (e.g. table storing time-series data etc.), and this blowup
> is quite bad ...
>
>
> True, but we don't know how big of a problem the density-skew
> problem might be (since the current algorithm isn't sensitive to it).
> It might be just as big of a problem. Tom mentioned some ancient
> history in the above mentioned thread that made me think the density
> skew was enough of a problem to motivate the current system.
>
>
> I don't think we need to really assume the density to be constant,
> maybe we can verify that while collecting the sample? I mean, we're
> already reading the pages, so we can track the density, and either
> do the correction or not.
>
>
> Maybe. I don't know how that would work.

I was thinking about something like

(1) compute average tuples per page

(2) generate random sample of 'samplesize' blocks

(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the block)

(4) repeat until you have sample of the requested size

The first weak spot is that this requires a knowledge of reltuples,
which is something we currently estimate based on the sample, so it
would have to use the old values I guess.

Also, I'm not sure this works that great with blocks that have higher
density - you'd like to choose them with a higher probability, but if
you've already selected them there's no point in repeating that.

> We would have to keep two samples, and dynamically decide which to
> use. And what if the decision is that both density skew is a problem
> and that value clustering is also a problem?

I don't see why we would have to keep two samples? The idea is that
sampling the blocks truly randomly solves the clustering issue, and the
correction I mentioned above solves the density skew problem.

> I wonder if the n_distinct could be tweaked so that it counted any
> given value only once for each block it finds it in? So instead of
> asking "how many times was this value sampled", ask "in how many
> blocks was this value sampled at least once"?

I don't think that's a really good idea. Imagine a column with a very
low cardinality, so that every block you sample has just a very few
distinct values. That's going to get very nasty very soon, especially
when combined with estimators assuming a truly random sample (which is
exactly the issue with the current ndistinct estimator).

>
> Also, doesn't Vitter do pretty much the same assumption implicitly,
> otherwise it couldn't skipping some of the blocks?
>
>
> Vitter samples an unstructured stream in a single pass, and is
> unbiased. The explicit block sampling is not part of Vitter, it is
> something we bolted on top of it.

Yeah, you're right of course. Sorry for not being quite precise. We're
using a variant of the S algorithm, based on the idea that we know the
number of blocks at the beginning - but that's pretty much the place
where we assume uniform density of rows per block, no? Because S is
skipping blocks using that assumption implicitly.

>
> My solution was to just unbolt the block sampling from the top, and let
> it sample the rows (still 300 * stats_target of them) from the whole
> table rather than a random 300 * 100 blocks of the table. On tables of
> the size I was worried about, block sampling was not very helpful
> anyway. Reading 30,000 blocks out of 250,000 blocks lead to no
> meaningful IO advantage on my hardware. Any advantage from skipped
> blocks was made up for (and sometimes exceeded) by fouling up the
> read-ahead mechanisms.
>
> With 1000 times more blocks, that probably won't work for you.
> But I do wonder, how much time does it take to read a random 1/10,
> 1/100, 1/1000, 1/10000 of a table of your size, just from an IO
> perspective? How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large
tables. With a 2.5TB table, you don't really care whether analyze
collects 5GB or 8GB sample, the difference is rather minor compared to
I/O generated by the other queries etc. The current sample is already
random enough not to work well with read-ahead, and it scans only a
slightly lower number of blocks.

And if the larger "random" sample results in better plans (especially
plans without OOM) ...

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-06-20 17:27:16 Re: Insufficient locking for ALTER DEFAULT PRIVILEGES
Previous Message Tom Lane 2015-06-20 16:53:16 Re: Extension support for postgres_fdw