Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-06-04 18:47:46
Message-ID: CAAKRu_b_XVe0LGVJGoQA77pZJ6xs7xsyyQPbQ+GBA8=Ot79i7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 4, 2019 at 5:43 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > I was talking to Jeff Davis about this on Saturday, and, he felt that
> > there might be a way to solve the problem differently if we thought of
> > the left join case as performing an inner join and an antijoin
> > instead.
> >
> > Riffing on this idea a bit, I started trying to write a patch that
> > would basically emit a tuple if it matches and write the tuple out to
> > a file if it does not match. Then, after iterating through the outer
> > batch the first time for the first inner chunk, any tuples which do
> > not yet have a match are the only ones which need to be joined against
> > the other inner chunks. Instead of iterating through the outer side
> > original batch file, use the unmatched outer tuples file to do the
> > join against the next chunk. Repeat this for all chunks.
>
> I'm not sure that I understanding this proposal correctly, but if I am
> then I think it doesn't work in the case where a single outer row
> matches rows in many different inner chunks. When you "use the
> unmatched outer tuples file to do the join against the next chunk,"
> you deny any rows that have already matched the chance to produce
> additional matches.
>
>
Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

--
Melanie Plageman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-06-04 19:08:24 Re: Avoiding hash join batch explosions with extreme skew and weird stats
Previous Message Melanie Plageman 2019-06-04 18:33:18 Re: Sort support for macaddr8