Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Alex Richman <alexrichman(at)onesignal(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Niels Stevens <niels(dot)stevens(at)onesignal(dot)com>
Subject: Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1
Date: 2023-01-06 07:11:10
Message-ID: CAD21AoDXJd1Co9hC665CFUbj47_HGA0k4HdadOXGoPKyYK6ixQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On Tue, Jan 3, 2023 at 9:57 PM Alex Richman <alexrichman(at)onesignal(dot)com> wrote:
>
> Apologies for the delay (and happy christmas/new years).
>
> Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same environment described in my original email. Included as a gist due to the length but lmk if it should be pasted in full for posterity. I've also added the python script[2] used to replicate, if that's relevant.
>
> Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few additional things:
> - This has occurred over multiple distinct servers with different data sets, though similar write loads. Suggesting it's not a specific server with data corruption.
> - Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
> - Running the same script on a read-only logical replica does not hit the bug
> - As above, if the server is idle (no write traffic), then it does not hit the bug
> - The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
> - We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
> So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in the logical replication slot creation.
>
> Please let me know if you need any more information.

Thank you for providing more information.

One possibility is that you encountered the bug in snapbuild.c that is
already fixed by commit 898ef41bf6f4 and will be included in 14.7 and
15.2. I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
fix_pg15.patch application/octet-stream 2.1 KB
fix_pg14.patch application/octet-stream 2.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message prevot morvan 2023-01-06 12:56:41 Issue with SQL query causing unintended consequences in database
Previous Message Amit Kapila 2023-01-06 04:03:42 Re: Logical Replica ReorderBuffer Size Accounting Issues