Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

From: Alex Richman <alexrichman(at)onesignal(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Niels Stevens <niels(dot)stevens(at)onesignal(dot)com>
Subject: Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1
Date: 2023-01-03 12:56:33
Message-ID: CAMnUB3oJ5rjxxrtB6vc15RdyqDeVtsfMn1aEehzWATw56VNtow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Apologies for the delay (and happy christmas/new years).

Please find included a full backtrace[1] of a sample of this crash,
replicated on postgres 15.1-1 in the same environment described in my
original email. Included as a gist due to the length but lmk if it should
be pasted in full for posterity. I've also added the python script[2] used
to replicate, if that's relevant.

Unfortunately we have not been able to reproduce this in a clean room
environment, however we can note a few additional things:
- This has occurred over multiple distinct servers with different data
sets, though similar write loads. Suggesting it's not a specific server
with data corruption.
- Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the
bug can still occur
- Running the same script on a read-only logical replica does not hit the
bug
- As above, if the server is idle (no write traffic), then it does not hit
the bug
- The bug occurs roughly 1 in every 10 executions of the create replication
slot, so is not 100% consistent.
- We're fairly confident that this did not occur pre 14.5-1, and started
occurring in 14.6-1 & 15.1-1.
So we would assume that there is some concurrent write traffic from our web
tier that sometimes causes a segfault in the logical replication slot
creation.

Please let me know if you need any more information.

Thanks,
- Alex.

[1]
https://gist.github.com/alex-richman-onesignal/0ba626ca4de00d24c331b4b224f0b399
[2]
https://gist.github.com/alex-richman-onesignal/d1e51f2d4e0ae9da80f00457655b2eb8

On Thu, 15 Dec 2022 at 03:06, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman(at)onesignal(dot)com>
> wrote:
> >
> > We've noticed a segfault bug while creating logical replication slots.
> We've seen this on 14.6-1 and 15.1-1 (both installed from postgres
> apt-archive repos to debian 11 hosts). We're confident that this behaviour
> is not present in 14.5-1.
> >
> > We can only reproduce this if the server is under load (reads/writes
> from clients, perhaps also pg_repacks, reindexes etc). If the server is
> idle then we cannot reproduce it.
> >
> > Dmesg logs:
> > 211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp
> 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
> >
> > Postgres logs:
> > 2022-12-14 22:33:58 UTC 2297089STATEMENT: SELECT
> pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput',
> false);
> > 2022-12-14 22:33:58 UTC 778310LOG: server process (PID 2297089) was
> terminated by signal 11: Segmentation fault
> > 2022-12-14 22:33:58 UTC 778310DETAIL: Failed process was running:
> SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28',
> 'pgoutput', false);
> > 2022-12-14 22:33:58 UTC 778310LOG: terminating any other active server
> processes
> >
> > Journal logs:
> > Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel:
> postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068
> error 4 in postgres[559079e4d000+51a000]
> > Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel:
> Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75
> b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >
> >
> > Let me know if you need any more info.
> >
>
> It is difficult to diagnose this without more information because
> there is no clear call stack or a reproducible scenario. Will it be
> possible for you to get a reproducible test case or at least call
> stack to proceed?
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-01-04 02:48:28 BUG #17736: when psql -c is used, the $ sign is escaped
Previous Message Amit Kapila 2023-01-03 03:55:29 Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG