Re: Backend handling replication slot stuck using 100% cpu, unkillable

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: depesz(at)depesz(dot)com
Cc: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Backend handling replication slot stuck using 100% cpu, unkillable
Date: 2023-07-04 13:04:58
Message-ID: d25141d4-c390-b74b-53c1-97064d07b7c9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 7/4/23 13:53, hubert depesz lubaczewski wrote:
> On Tue, Jul 04, 2023 at 01:30:21PM +0200, Tomas Vondra wrote:
>> On 7/3/23 14:58, hubert depesz lubaczewski wrote:
>> So is it an infinite loop in ReorderBufferExecuteInvalidations, or is it
>> just the case that there are many invalidations? I can't really deduce
>> that from the backtraces.
>>
>> How many invalidations does the transaction have? Should be enough to
>>
>> print txn->ninvalidations
>>
>> Also, is there anything interesting about the transaction? You know the
>> XID (2741814901) so maybe use pg_waldump to see what it did.
>
> I don't have the process anymore. Pg was restarted, and the app was
> forced to not take data from this place.
>

OK. So the slot is still there, and we could try reading the data again
(hopefully getting stuck in the same place).

> I can run pg_waldump, no problem, but I'm not an expert on this. Which
> wal file(s) should I try to dump? and what to look for (or should I just
> provide full output of dump?
>

The backtrace has this:

and 187650155969544 should be LSN AAAA/B4E37C08, which maps to

select pg_walfile_name('AAAA/B4E37C08');
pg_walfile_name
--------------------------
000000010000AAAA000000B4

So maybe try dumping that WAL segment and see if the XID 2741814901 is
there. If yes, maybe share the whole dump, it's difficult to say what to
share without knowing what the transaction did.

There's also the LSNs in pg_replication_slots - it might be interesting
to see dump for all the WAL between restart_lsn and confirmed_flush_lsn.
So like this segments between:

SELECT pg_walfile_name(restart_lsn),
pg_walfile_name(confirmed_flush_lsn)
FROM pg_replication_slots WHERE slot_name = 'slot_name';

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Vamshikrishna T 2023-07-04 13:48:21 Re: BUG #18009: Postgres Recovery not happening
Previous Message hubert depesz lubaczewski 2023-07-04 11:53:43 Re: Backend handling replication slot stuck using 100% cpu, unkillable