Re: BUG #18334: Segfault when running a query with parallel workers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Marcin Barczyński <mba(dot)ogolny(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18334: Segfault when running a query with parallel workers
Date: 2024-05-24 00:45:14
Message-ID: CA+hUKG+7KA6wQGx4yFBNj5KaTooErV2Ov1+m_ers4DVZWJ_mKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba(dot)ogolny(at)gmail(dot)com> wrote:
> (gdb) print *segment_map
> $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "",
> header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap =
> 0x7f309faf4480}
>
> (gdb) print pageno
> $5 = 196979

Hmm. Page 196979 is an offset of around 769MB within the segment
(pages here are 4k). What does segment_map->segment->mapped_size
show? It's OK for the pagemap to contain zeroes, but it should
contain non-zero values for pages that contain the start of an
allocated object. The actual dsa_pointer has been optimised out but
should be visible from frame #1 as batch->chunks. I think its higher
24 bits should contain 13 (the element of area->segment_maps that
seems to correspond to the above), and its lower 40 bits should
contain that number ~769MB.

The things that are unusually high so far in your emails are worker
count and work_mem, so that it can make quite large hash tables, in
your case up to 13GB. Perhaps there is a silly arithmetic/type
problem around large numbers somewhere (perhaps somewhere near 4GB+
segments, but I don't expect segment #13 to be very large IIRC). But
then that would fail more often I think... It seems to be
rare/intermittent, and yet you don't have any batching or re-bucketing
in your problem (nbatch and nbuckets have their original values), so a
lot of the more complex parts of the PHJ code are not in play here.
Hmm.

I wondered if the tricky edge case where a segment gets unmapped and
then then remapped in the same slot could be leading to segment
confusion. That does involve a bit of memory order footwork. What
CPU architecture is this? But alas I can't come up with any case
where that could go wrong even if there is an unknown bug in that
area, because the no-rebatching, no-rebucketing case doesn't free
anything until the end when it frees everything (ie it never frees
something and then allocate, a requirement for slot re-use).

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2024-05-24 01:32:26 Re: BUG #18334: Segfault when running a query with parallel workers
Previous Message Tom Lane 2024-05-23 18:26:10 Re: BUG #18477: A specific SQL query with "ORDER BY ... NULLS FIRST" is performing poorly if an ordering column is n