From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, David Hinkle <hinkle(at)cipafilter(dot)com> |
Subject: | Re: Avoiding OOM in a hash join with many duplicate inner keys |
Date: | 2017-02-16 20:56:49 |
Message-ID: | CA+TgmoZCFcNs+BuTziv+zY=MAZ2yqtR=0Kq5305s4QaRxCOqiw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does. I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join? That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size. But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in. There's generally a whole lot of noise in the tail-end
> MCV numbers.
Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.
> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats. That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats. With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats. Given how long it took us
> to recognize this problem, that's probably the wrong default.
Right.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | David Christensen | 2017-02-16 20:58:57 | [PATCH] Add pg_disable_checksums() and supporting infrastructure |
Previous Message | Tom Lane | 2017-02-16 20:51:20 | Re: Avoiding OOM in a hash join with many duplicate inner keys |