Re: accounting for memory used for BufFile during hash joins

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-08 00:30:27
Message-ID: CAAKRu_aUirDu2gud_jgqdV73KKycjNXY2GZ=2+FDMeorO6iTDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.
>
> It does slightly increase the amount of data we need to shuffle between
> the temp files, because we can't write the data directly to batches in
> "future" slices. But that amplification is capped to ~2.2x (compared to
> the ~1.4x in master) - I've shared some measurements in [1].
>
> [1]
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
>
Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

In terms of how many times you write each tuple, is it accurate to say that
a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to
that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again)
to its batch's spill file

Is it correct that the max number of BufFile structs you will have is equal
to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?

By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

One thing I was a little confused by was the nbatch_inmemory member of the
hashtable. The comment in ExecChooseHashTableSize says that it is
determining
the number of batches we can fit in memory. I thought that the problem was
the
amount of space taken up by the BufFile data structure itself--which is
related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting
more
than one batch of tuples into memory at a time. I was under the impression
that
you could only fit one batch of tuples in memory at a time.

So, I was stepping through the code with work_mem set to the lower bound,
and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?

--
Melanie Plageman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2019-05-08 00:43:56 Re: accounting for memory used for BufFile during hash joins
Previous Message Masahiko Sawada 2019-05-08 00:26:35 Re: vacuumdb and new VACUUM options