Re: ReplicationSlotRelease may set the statusFlags of other processes in PG14

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: feichanghong <feichanghong(at)qq(dot)com>
Cc: pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>, andres <andres(at)anarazel(dot)de>, "sawada(dot)mshk" <sawada(dot)mshk(at)gmail(dot)com>, "horikyota(dot)ntt" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: ReplicationSlotRelease may set the statusFlags of other processes in PG14
Date: 2024-03-19 03:57:51
Message-ID: ZfkNP1OdgBSPPTsR@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Mar 16, 2024 at 10:29:03PM +0800, feichanghong wrote:
> A process utilizing replication slots (usually walsender) calls callback
> functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
> Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
> ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
> via ReplicationSlotRelease. By this time, the flag may already be assigned to
> another process.

Oops.

> To replicate the issue, execute the following steps:
> 1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
> where pgxactoff is set to an invalid value in ProcArrayRemove, and some
> checks are added.
> 2. Use the SQL below to terminate the walsender process.
> ```
> select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
> ```
> # Fix
>
> To fix the issue, I have provided some patches in the attachment:
> 1. Backpatching 2f6501f into the PG14 version will fix the problem.
> 2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
> checks should be done when setting ProcGlobal->statusFlags.

Yeah, that's something that we had better fix in all stable branches.
The asserts would offer some protection moving on, but I would take
the safer move of only adding a protection like what you are
suggestion on HEAD and not in stable branches, just in case we're
missing something around them.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-03-19 04:39:30 RE: Potential data loss due to race condition during logical replication slot creation
Previous Message ocean_li_996 2024-03-19 02:58:38 Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()