Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2023-02-10 14:31:24
Message-ID: Y+ZVPHHcYirQDgJF@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,
so, we have another bit of interesting information. maybe related, maybe
not.

We noticed weird situation on two clusters we're trying to upgrade.

In both cases sitaution looked the same:

1. there was another process (debezium) connected to source (pg12) using
logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR: requested
WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
is/was responsible for debezium replicaiton (pg process) stopped
handling WAL, but instead is eating 100% of cpu.

When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !

strace of the process doesn't show anything.

When I tried to get backtrace from gdb all I got was:

(gdb) bt
#0 0x0000aaaad270521c in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()

If I'd quit gdb, and restart, and redo bt, I get

#0 0x0000ffff806c81a8 in hash_seq_search(at)plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad291ae58 in ?? ()

or

#0 0x0000aaaad2705244 in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()

At this moment, the only thing that we can do is kill -9 the process (or
restart pg).

I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.

Best regards,

depesz

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Timur 2023-02-10 16:24:23 CREATE INDEX CONCURRENTLY cannot be executed within a pipeline
Previous Message Heikki Linnakangas 2023-02-10 14:02:31 Re: BUG #17760: SCRAM authentication fails with "modern" (rsassaPss signature) server certificate