Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Jesse Zhang <sbjesse(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-06-24 15:55:11
Message-ID: CAGf+fX7Ayre-nH1JynJd3jnjEkA=bBjbfPXLKPfrw4Ethw5Cpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote:
>
> Now, a couple comments / questions about the code.
>
>
> nodeHash.c
> ----------
>
>
> 1) MultiExecPrivateHash says this
>
> /*
> * Not subject to skew optimization, so either insert normally
> * or save to batch file if it belongs to another stripe
> */
>
> I wonder what it means to "belong to another stripe". I understand what
> that means for batches, which are identified by batchno computed from
> the hash value. But I thought "stripes" are just work_mem-sized pieces
> of a batch, so I don't quite understand this. Especially when the code
> does not actually check "which stripe" the row belongs to.

I have to concur that "stripe" did inspire a RAID vibe when I heard it,
but it seemed to be a better name than what it replaces

> 3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches
>
> childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
>
> and also about this comment
>
> /*
> * TODO: what to do about tuples that don't go to the child
> * batch or stay in the current batch? (this is why we are
> * counting tuples to child and curbatch with two diff
> * variables in case the tuples go to a batch that isn't the
> * child)
> */
> if (batchno == childbatch)
> childbatch_outgoing_tuples++;
>
> I thought each old batch is split into two new ones, and the tuples
> either stay in the current one, or are moved to the new one - which I
> presume is the childbatch, although I haven't tried to decode that
> formula. So where else could the tuple go, as the comment tried to
> suggest?

True, every old batch is split into two new ones, if you only consider
tuples coming from the batch file that _still belong in there_. i.e.
there are tuples in the old batch file that belong to a future batch. As
an example, if the current nbatch = 8, and we want to expand to nbatch =
16, (old) batch 1 will split into (new) batch 1 and batch 9, but it can
already contain tuples that need to go into (current) batches 3, 5, and
7 (soon-to-be batches 11, 13, and 15).

Cheers,
Jesse

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-06-24 16:09:35 Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary
Previous Message Robert Haas 2020-06-24 15:51:24 Re: extensible options syntax for replication parser?