Re: accounting for memory used for BufFile during hash joins

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-07 04:28:36
Message-ID: CA+hUKG+546WK=ya70QMs-oWE94BhsUKvqjr=ds_X_wkv7fP-2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
> >Seems expensive for large numbers of slices -- you need to join the
> >outer batch against each inner slice.
>
> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:

A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
> >another thread:
> >
> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>
> That seems unrelated - we slice the array of batches, to keep memory
> needed for BufFile under control. The hash table remains intact, so
> there's no issue with outer joins.

Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

> I agree we should relax the 0%/100% split condition, and disable the
> growth sooner. But I think we should also re-evaluate that decision
> after a while - the data set may be correlated in some way, in which
> case we may disable the growth prematurely. It may not reduce memory
> usage now, but it may help in the future.
>
> It's already an issue, but it would be even more likely if we disabled
> growth e.g. with just 5%/95% splits.
>
> FWIW I believe this is mostly orthogonal issue to what's discussed in
> this thread.

But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?

I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

> Switching to some other algorithm during execution moves the goal posts
> to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2019-05-07 05:24:19 PG12, PGXS and linking pgfeutils
Previous Message Alexander Korotkov 2019-05-07 04:14:40 Re: jsonpath