From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Avoiding hash join batch explosions with extreme skew and weird stats |
Date: | 2019-06-03 21:10:21 |
Message-ID: | CAAKRu_YuQ4gA7XzrH60BbevpjBXZJ_=wS++7DmWtkgG3BBQwfw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, May 19, 2019 at 4:07 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
> >> Admittedly I don't have a patch, just a bunch of handwaving. One
> >> reason I haven't attempted to write it is because although I know how
> >> to do the non-parallel version using a BufFile full of match bits in
> >> sync with the tuples for outer joins, I haven't figured out how to do
> >> it for parallel-aware hash join, because then each loop over the outer
> >> batch could see different tuples in each participant. You could use
> >> the match bit in HashJoinTuple header, but then you'd have to write
> >> all the tuples out again, which is more IO than I want to do. I'll
> >> probably start another thread about that.
> >
> > Could you explain more about the implementation you are suggesting?
> >
> > Specifically, what do you mean "BufFile full of match bits in sync with
> the
> > tuples for outer joins?"
>
> First let me restate the PostgreSQL terminology for this stuff so I
> don't get confused while talking about it:
>
> * The inner side of the join = the right side = the side we use to
> build a hash table. Right and full joins emit inner tuples when there
> is no matching tuple on the outer side.
>
> * The outer side of the join = the left side = the side we use to
> probe the hash table. Left and full joins emit outer tuples when
> there is no matching tuple on the inner side.
>
> * Semi and anti joins emit exactly one instance of each outer tuple if
> there is/isn't at least one match on the inner side.
>
> We have a couple of relatively easy cases:
>
> * Inner joins: for every outer tuple, we try to find a match in the
> hash table, and if we find one we emit a tuple. To add looping
> support, if we run out of memory when loading the hash table we can
> just proceed to probe the fragment we've managed to load so far, and
> then rewind the outer batch, clear the hash table and load in the next
> work_mem-sized fragment and do it again... rinse and repeat until
> we've eventually processed the whole inner batch. After we've
> finished looping, we move on to the next batch.
>
> * For right and full joins ("HJ_FILL_INNER"), we also need to emit an
> inner tuple for every tuple that was loaded into the hash table but
> never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
> header of the tuples of the hash table, and a scan through the whole
> hash table at the end of each batch to look for unmatched tuples
> (ExecScanHashTableForUnmatched()). To add looping support, that just
> has to be done at the end of every inner batch fragment, that is,
> after every loop.
>
> And now for the cases that need a new kind of match bit, as far as I can
> see:
>
> * For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
> outer tuple for every tuple that didn't find a match in the hash
> table. Normally that is done while probing, without any need for
> memory or match flags: if we don't find a match, we just spit out an
> outer tuple immediately. But that simple strategy won't work if the
> hash table holds only part of the inner batch. Since we'll be
> rewinding and looping over the outer batch again for the next inner
> batch fragment, we can't yet say if there will be a match in a later
> loop. But the later loops don't know on their own either. So we need
> some kind of cumulative memory between loops, and we only know which
> outer tuples have a match after we've finished all loops. So there
> would need to be a new function ExecScanOuterBatchForUnmatched().
>
> * For semi joins, we need to emit exactly one outer tuple whenever
> there is one or more match on the inner side. To add looping support,
> we need to make sure that we don't emit an extra copy of the outer
> tuple if there is a second match in another inner batch fragment.
> Again, this implies some kind of memory between loops, so we can
> suppress later matches.
>
> * For anti joins, we need to emit an outer tuple whenever there is no
> match. To add looping support, we need to wait until we've seen all
> the inner batch fragments before we know that a given outer tuple has
> no match, perhaps with the same new function
> ExecScanOuterBatchForUnmatched().
>
> So, we need some kind of inter-loop memory, but we obviously don't
> want to create another source of unmetered RAM gobbling. So one idea
> is a BufFile that has one bit per outer tuple in the batch. In the
> first loop, we just stream out the match results as we go, and then
> somehow we OR the bitmap with the match results in subsequent loops.
> After the last loop, we have a list of unmatched tuples -- just scan
> it in lock-step with the outer batch and look for 0 bits.
>
> Unfortunately that bits-in-order scheme doesn't work for parallel
> hash, where the SharedTuplestore tuples seen by each worker are
> non-deterministic. So perhaps in that case we could use the
> HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
> the whole outer batch back out each time through the loop. That'd
> keep the tuples and match bits together, but it seems like a lot of
> IO... Note that parallel hash doesn't support right/full joins today,
> because of some complications about waiting and deadlocks that might
> turn out to be relevant here too, and might be solvable (I should
> probably write about that in another email), but left joins *are*
> supported today so would need to be desupported if we wanted to add
> loop-based escape valve but not deal with with these problems. That
> doesn't seem acceptable, which is why I'm a bit stuck on this point,
> and unfortunately it may be a while before I have time to tackle any
> of that personally.
>
>
There was an off-list discussion at PGCon last week about doing this
hash looping strategy using the bitmap with match bits and solving the
parallel hashjoin problem by having tuple-identifying information
encoded in the bitmap which allowed each worker to indicate that an
outer tuple had a match when processing that inner side chunk and
then, at the end of the scan of the outer side, the bitmaps would be
OR'd together to represent a single view of the unmatched tuples from
that iteration.
I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.
Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.
Could we not do this and avoid using the match bit? In the worst case,
you would have to write out all the tuples on the outer side (if none
match) nchunks times (chunk is the work_mem sized chunk of inner
loaded into the hashtable).
--
Melanie Plageman
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2019-06-03 21:20:42 | Re: initdb recommendations |
Previous Message | Chapman Flack | 2019-06-03 21:03:16 | Re: Sort support for macaddr8 |