From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers <pgsql-committers(at)postgresql(dot)org> |
Subject: | Re: pgsql: Add parallel-aware hash joins. |
Date: | 2018-01-03 12:06:38 |
Message-ID: | CAEepm=34PDuR69kfYVhmZPgMdy8pSA-MYbpesEN1SR+2oj3Y+w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
On Wed, Jan 3, 2018 at 2:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> I mean that ExecChooseHashTableSize() estimates the hash table size like this:
>> inner_rel_bytes = ntuples * tupsize;
>
>> ... but then at execution time, in the Parallel Hash case, we do
>> memory accounting not in tuples but in chunks. The various
>> participants pack tuples into 32KB chunks, and they trigger an
>> increase in the number of batches when the total size of all chunks
>> happens to exceeds the memory budget. In this case they do so
>> unexpectedly due to that extra overhead at execution time that the
>> planner didn't account for. We happened to be close to the threshold,
>> in this case between choosing 8 batches and 16 batches, we can get it
>> wrong and have to increase nbatch at execution time.
>
> If that's the issue, why doesn't the test fail every time on affected
> platforms? There shouldn't be anything nondeterministic about the
> number or size of tuples going into the hash table?
>
>> ... You get a
>> larger size if more workers manage to load at least one tuple, due to
>> their final partially filled chunk.
>
> Hm. That could do it, except it doesn't really account for the observed
> result that slower single-processor machines seem more prone to the
> bug. Surely they should be less likely to get multiple workers activated.
I can reproduce the instability here with an -m32 build and this:
create table simple as
select generate_series(1, 20000) AS id,
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;
set parallel_setup_cost = 0;
set work_mem = '96kB';
explain analyze select count(*) from simple r join simple s using
(id);
It plans for 8 batches, and then usually but not always goes to 16 at
execution time depending on timing. It doesn't happen for me with
128kB (the setting used in the regression test), but I think the
affected BF machines are all 32 bit systems that have MAXIMUM_ALIGNOF
8 and therefore use a bit more space, whereas my machines have
MAXIMUM_ALIGNOF 4 in a 32 bit build, so that would explain the
different location of this unstable border. We could probably fix the
failures by simply increasing work_mem out of that zone, but I'm
hoping to fix the problem in a more principled way. More soon.
> BTW, I'm seeing a few things that look bug-like about
> ExecParallelHashTuplePrealloc, for instance why does it use just
> "size" to decide if space_allowed is exceeded but if not then add the
> typically-much-larger value "want + HASH_CHUNK_HEADER_SIZE" to
> estimated_size. That clearly can allow estimated_size to get
> significantly past space_allowed --- if it's not a bug, it at least
> deserves a comment explaining why not.
Right. Fixed in the attached.
> Another angle, which does not
> apply to this test case but seems like a bug for real usage, is that
> ExecParallelHashTuplePrealloc doesn't account correctly for tuples wider
> than HASH_CHUNK_THRESHOLD.
Right. I'll address that separately.
> I'm also wondering why the non-parallel path seems to prefer to allocate
> in units of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE while the parallel
> path targets allocations of exactly HASH_CHUNK_SIZE,
That is intentional: dsa.c sucks at allocating 32KB + a tiny bit
because it works in 4KB pages for large allocations, so I wanted to
make HASH_CHUNK_SIZE the total size that arrives into dsa_allocate().
The non-parallel path has similar problems on some libc
implementations, as we discussed over here:
https://www.postgresql.org/message-id/flat/29770.1511495642%40sss.pgh.pa.us
> and why there's such
> inconsistency in whether tuples of exactly HASH_CHUNK_THRESHOLD bytes
> are treated as "big" or not.
Right, that's inconsistent. Fixed in the attached.
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
fix-phj-fencepost.patch | application/octet-stream | 1.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2018-01-03 14:27:00 | Re: pgsql: Fix deadlock hazard in CREATE INDEX CONCURRENTLY |
Previous Message | Bruce Momjian | 2018-01-03 04:30:43 | pgsql: Update copyright for 2018 |
From | Date | Subject | |
---|---|---|---|
Next Message | Jonathan S. Katz | 2018-01-03 12:47:04 | Re: CFM for January commitfest? |
Previous Message | Dmitry Dolgov | 2018-01-03 11:40:35 | Re: [HACKERS] eval_const_expresisions and ScalarArrayOpExpr |