Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date: 2015-06-22 05:47:31
Message-ID: CAMkU=1wOk0WJjyUU0ZboZpdpSenGvhPXUW_K+vciorW9+tGaqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

Hi Tomas,

I've lobotomized the sampling a bit to really produce a random set of
> blocks first, and that produces way better estimates:
>
> statistics target estimate random
> -----------------------------------------------------------------
> 100 429491 (10000x) 334430766 (14x)
> 1000 4240418 (1000x) 439010499 (10x)
>
> Also, the number of sampled blocks is not that different. With target 100,
> the current sampler reads ~2900 blocks, while a completely random sampler
> uses 3000 blocks. So, where's the benefit?
>

I don't know you did. The block sampling is already random, unless there
is some overlooked bug, it can't be made more random. Could you post the
patch?

As I understand it, with a target of 100, it should be sampling exactly
30,000 blocks. Some of those blocks will end up having no rows chosen from
them (just by chance), but the blocks were still read. If a few thousand
of those blocks end up with no tuples in the final sample, the motivation
for that was not to save IO, is just an artifact of how the random sampling
work.

Thanks,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-06-22 05:51:39 Re: checkpointer continuous flushing
Previous Message Jeff Janes 2015-06-22 05:21:22 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H