Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Craig Milhiser <craig(at)milhiser(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date: 2024-10-14 09:16:11
Message-ID: 3a1ac6f1-ac30-4ed4-8d0d-beb8a5aca7e5@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 10/14/24 13:26, Tom Lane wrote:
> Andrei Lepikhov <lepihov(at)gmail(dot)com> writes:
>> My explanation (correct if I'm wrong):
>> OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash
>> value for NULL is 0, and it goes to the batch==0.
>> If batch number 0 gets overfilled, the
>> ExecParallelHashIncreaseNumBatches routine attempts to increase the
>> number of batches - but nothing happens. The initial batch is still too
>> big, and the number of batches doubles up to the limit.
>
> Interesting point. If memory serves (I'm too tired to actually look)
> the planner considers the statistical most-common-value when
> estimating whether an unsplittable hash bucket is likely to be too
> big. It does *not* think about null values ... but it ought to.
As I see it, it is just an oversight in the resizing logic: batch 0
doesn't change the estimated_size value at all - I think because it
doesn't matter for this batch - it can't be treated as exhausted by
definition. Because of that, parallel HashJoin doesn't detect extreme
skew, caused by duplicates in this batch. NULLS is just our luck - they
correspond to hash value 0 and fall into this batch.
See the attachment for a sketch of the solution.

>
> However, this does not explain why PHJ would be more subject to
> the problem than non-parallel HJ.
Good question! I rarely touch this part of the code and maybe don't see
whole picture. But as I see it, HJ is designed differently:
repartitioning machinery is based on overall hash table size and number
of tuples and has nothing similar to 'batch 0' or parallel batches. Hash
table size is calculated for each batch and can't cause this bug.

BTW, Can we also resolve here the long-living corner case with "invalid
DSA memory alloc request size" [1]? Just because we have clear
reproduction ...

[1]
https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

--
regards, Andrei Lepikhov

Attachment Content-Type Size
0001-Consider-extreme-skew-in-batch-0-during-Parallel-Has.patch text/x-patch 1.4 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Maxim Boguk 2024-10-14 12:33:03 Re: BUG #18644: ALTER PUBLICATION ... SET (publish_via_partition_root) wrong/undocumented behavior.
Previous Message PG Bug reporting form 2024-10-14 09:14:50 BUG #18654: From fuzzystrmatch, levenshtein function with costs parameters produce incorrect results