Re: Undetected deadlock between client backend and startup processes on a standby (Previously, Undetected deadlock between primary and standby processes)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Rintaro(dot)Ikeda(at)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Undetected deadlock between client backend and startup processes on a standby (Previously, Undetected deadlock between primary and standby processes)
Date: 2024-03-10 20:43:11
Message-ID: ea96bc84-e242-4179-a440-9d4b8a7bae9f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 3/4/24 09:35, Rintaro(dot)Ikeda(at)nttdata(dot)com wrote:
> Hi,
>
> I correct the previous bug report [1] to provide a more accurate
> description. The bug report demonstrated undetected deadlock between
> client backend and startup processes on a standby server. (The title
> in the previous bug report is "Undetected deadlock between primary
> and standby processes". But this was wrong. Actually, this should be
> noted that "Undetected deadlock between client backend and startup
> process on a standby server".)
>
> After the procedures proposed in my bug report [1], a recovery
> conflict is present because the tablespace which startup process
> tries to drop is used by cliend backend process in standby. We see
> the pg_stat_activity (shown below), which implies a deadlock. A
> client backend process waits for AccessExclusiveLock to be released.
> Startup process waits for recovery conflict resolution for dropping
> the tablespace. This deadlock is not resolved after deadlock_timeout
> passes.
>
> (Standby server)
> postgres=# select datid, datname, wait_event_type, wait_event, query, backend_type from pg_stat_activity ;
> datid | datname | wait_event_type | wait_event | query | backend_type
> -------+----------+-----------------+----------------------------+-------------------------------------------------------------------------------------------------+-------------------
> 5 | postgres | Lock | relation | SELECT * FROM t; | client backend
> | | IPC | RecoveryConflictTablespace | | startup
>
>
> This deadlock is similar to the previously identified and patched
> issue [2], which also involved an undetected deadlock between
> backend process and recovery on a standby server. I think the
> deadlock explained in this report should be detected and resolved.
>

Thanks for the report.

So what are the steps to reproduce this? The previous message did all
kinds of stuff on the primary and then got stuck on pg_switch_wal() on
the primary, but this updated seems to do stuff on the standby and gets
the lockup there.

It seems similar in the sense that it's about interaction between
recovery and a regular backend, but unfortunately
ResolveRecoveryConflictWithVirtualXIDs does not wait for a lock, it just
checks if the XID is still running, so it's invisible to the deadlock
detector :-(

But it's still checked against max_standby_streaming_delay, which should
resolve the deadlock (unless set to -1 to allow infinite delays) at some
point, right?

Also, I'm not very familiar with ResolveRecoveryConflictWithVirtualXIDs,
but it seems it's doing a busy wait. I wonder if that's a good idea, but
it's independent of this bug report.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2024-03-11 05:25:43 Re: "type with xxxx does not exist" when doing ExecMemoize()
Previous Message PG Bug reporting form 2024-03-10 19:00:00 BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition