From: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Craig Milhiser <craig(at)milhiser(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker |
Date: | 2024-10-14 09:16:11 |
Message-ID: | 3a1ac6f1-ac30-4ed4-8d0d-beb8a5aca7e5@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 10/14/24 13:26, Tom Lane wrote:
> Andrei Lepikhov <lepihov(at)gmail(dot)com> writes:
>> My explanation (correct if I'm wrong):
>> OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash
>> value for NULL is 0, and it goes to the batch==0.
>> If batch number 0 gets overfilled, the
>> ExecParallelHashIncreaseNumBatches routine attempts to increase the
>> number of batches - but nothing happens. The initial batch is still too
>> big, and the number of batches doubles up to the limit.
>
> Interesting point. If memory serves (I'm too tired to actually look)
> the planner considers the statistical most-common-value when
> estimating whether an unsplittable hash bucket is likely to be too
> big. It does *not* think about null values ... but it ought to.
As I see it, it is just an oversight in the resizing logic: batch 0
doesn't change the estimated_size value at all - I think because it
doesn't matter for this batch - it can't be treated as exhausted by
definition. Because of that, parallel HashJoin doesn't detect extreme
skew, caused by duplicates in this batch. NULLS is just our luck - they
correspond to hash value 0 and fall into this batch.
See the attachment for a sketch of the solution.
>
> However, this does not explain why PHJ would be more subject to
> the problem than non-parallel HJ.
Good question! I rarely touch this part of the code and maybe don't see
whole picture. But as I see it, HJ is designed differently:
repartitioning machinery is based on overall hash table size and number
of tuples and has nothing similar to 'batch 0' or parallel batches. Hash
table size is calculated for each batch and can't cause this bug.
BTW, Can we also resolve here the long-living corner case with "invalid
DSA memory alloc request size" [1]? Just because we have clear
reproduction ...
[1]
https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru
--
regards, Andrei Lepikhov
Attachment | Content-Type | Size |
---|---|---|
0001-Consider-extreme-skew-in-batch-0-during-Parallel-Has.patch | text/x-patch | 1.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Maxim Boguk | 2024-10-14 12:33:03 | Re: BUG #18644: ALTER PUBLICATION ... SET (publish_via_partition_root) wrong/undocumented behavior. |
Previous Message | PG Bug reporting form | 2024-10-14 09:14:50 | BUG #18654: From fuzzystrmatch, levenshtein function with costs parameters produce incorrect results |