From: | hubert depesz lubaczewski <depesz(at)depesz(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date: | 2023-02-10 14:31:24 |
Message-ID: | Y+ZVPHHcYirQDgJF@depesz.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
so, we have another bit of interesting information. maybe related, maybe
not.
We noticed weird situation on two clusters we're trying to upgrade.
In both cases sitaution looked the same:
1. there was another process (debezium) connected to source (pg12) using
logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR: requested
WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
is/was responsible for debezium replicaiton (pg process) stopped
handling WAL, but instead is eating 100% of cpu.
When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !
strace of the process doesn't show anything.
When I tried to get backtrace from gdb all I got was:
(gdb) bt
#0 0x0000aaaad270521c in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()
If I'd quit gdb, and restart, and redo bt, I get
#0 0x0000ffff806c81a8 in hash_seq_search(at)plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad291ae58 in ?? ()
or
#0 0x0000aaaad2705244 in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()
At this moment, the only thing that we can do is kill -9 the process (or
restart pg).
I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.
Best regards,
depesz
From | Date | Subject | |
---|---|---|---|
Next Message | Timur | 2023-02-10 16:24:23 | CREATE INDEX CONCURRENTLY cannot be executed within a pipeline |
Previous Message | Heikki Linnakangas | 2023-02-10 14:02:31 | Re: BUG #17760: SCRAM authentication fails with "modern" (rsassaPss signature) server certificate |