From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Marcin Barczyński <mba(dot)ogolny(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18334: Segfault when running a query with parallel workers |
Date: | 2024-05-24 00:45:14 |
Message-ID: | CA+hUKG+7KA6wQGx4yFBNj5KaTooErV2Ov1+m_ers4DVZWJ_mKg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba(dot)ogolny(at)gmail(dot)com> wrote:
> (gdb) print *segment_map
> $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "",
> header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap =
> 0x7f309faf4480}
>
> (gdb) print pageno
> $5 = 196979
Hmm. Page 196979 is an offset of around 769MB within the segment
(pages here are 4k). What does segment_map->segment->mapped_size
show? It's OK for the pagemap to contain zeroes, but it should
contain non-zero values for pages that contain the start of an
allocated object. The actual dsa_pointer has been optimised out but
should be visible from frame #1 as batch->chunks. I think its higher
24 bits should contain 13 (the element of area->segment_maps that
seems to correspond to the above), and its lower 40 bits should
contain that number ~769MB.
The things that are unusually high so far in your emails are worker
count and work_mem, so that it can make quite large hash tables, in
your case up to 13GB. Perhaps there is a silly arithmetic/type
problem around large numbers somewhere (perhaps somewhere near 4GB+
segments, but I don't expect segment #13 to be very large IIRC). But
then that would fail more often I think... It seems to be
rare/intermittent, and yet you don't have any batching or re-bucketing
in your problem (nbatch and nbuckets have their original values), so a
lot of the more complex parts of the PHJ code are not in play here.
Hmm.
I wondered if the tricky edge case where a segment gets unmapped and
then then remapped in the same slot could be leading to segment
confusion. That does involve a bit of memory order footwork. What
CPU architecture is this? But alas I can't come up with any case
where that could go wrong even if there is an unknown bug in that
area, because the no-rebatching, no-rebucketing case doesn't free
anything until the end when it frees everything (ie it never frees
something and then allocate, a requirement for slot re-use).
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2024-05-24 01:32:26 | Re: BUG #18334: Segfault when running a query with parallel workers |
Previous Message | Tom Lane | 2024-05-23 18:26:10 | Re: BUG #18477: A specific SQL query with "ORDER BY ... NULLS FIRST" is performing poorly if an ordering column is n |