Quick Links

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

From:	Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date:	2024-01-18 11:29:39
Message-ID:	CALj2ACUo_rDTonZCkqdSnsc3tT5_cFcJQHSQrsyAYyO5MLO52A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jan 15, 2024 at 1:18 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> While working on [1], we discovered (thanks Alexander for the testing) that an
> conflicting active logical slot on a standby could be "terminated" without
> leading to an "obsolete" message (see [2]).
>
> Indeed, in case of an active slot we proceed in 2 steps in
> InvalidatePossiblyObsoleteSlot():
>
> - terminate the backend holding the slot
> - report the slot as obsolete
>
> This is racy because between the two we release the mutex on the slot, which
> means that the slot's effective_xmin and effective_catalog_xmin could advance
> during that time (leading to exit the loop).
>
> I think that holding the mutex longer is not an option (given what we'd to do
> while holding it) so the attached proposal is to record the effective_xmin and
> effective_catalog_xmin instead that was used during the backend termination.
>
> [1]: https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com
> [2]: https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com
>
> Looking forward to your feedback,

IIUC, the issue is that we terminate the process holding the
replication slot, and the conflict cause that we recorded before
terminating the process changes in the next iteration due to the
advancement in effective_xmin and/or effective_catalog_xmin.

FWIW, a test code something like [1], can help detect above race issues, right?

Some comments on the patch:

1.
last_signaled_pid = active_pid;
+ terminated = true;
}

Why is a separate variable needed? Can't last_signaled_pid != 0 enough
to determine that a process was terminated earlier?

2. If my understanding of the racy behavior is right, can the same
issue happen due to the advancement in restart_lsn? I'm not sure if it
can happen at all, but I think we can rely on previous conflict reason
instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

3. Is there a way to reproduce this racy issue, may be by adding some
sleeps or something? If yes, can you share it here, just for the
records and verify the whatever fix provided actually works?

[1]
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 52da694c79..d020b038bc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1352,6 +1352,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
{
int last_signaled_pid = 0;
bool released_lock = false;
+ ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE;

for (;;)
{
@@ -1417,6 +1418,18 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
}
}

+ /*
+ * Check if the conflict cause recorded previously
before we terminate
+ * the process changed now for any reason.
+ */
+ if (conflict_prev != RS_INVAL_NONE &&
+ last_signaled_pid != 0 &&
+ conflict_prev != conflict)
+ elog(PANIC, "conflict cause recorded before
terminating process %d has been changed; previous cause: %d, current
cause: %d",
+ last_signaled_pid,
+ conflict_prev,
+ conflict);
+
/* if there's no conflict, we're done */
if (conflict == RS_INVAL_NONE)
{
@@ -1499,6 +1512,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
(void) kill(active_pid, SIGTERM);

last_signaled_pid = active_pid;
+ conflict_prev = conflict;
}

/* Wait until the slot is released. */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Fix race condition in InvalidatePossiblyObsoleteSlot() at 2024-01-15 07:48:43 from Bertrand Drouvot

Responses

Re: Fix race condition in InvalidatePossiblyObsoleteSlot() at 2024-01-18 14:20:28 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2024-01-18 11:46:24	Re: remaining sql/json patches
Previous Message	Ashutosh Bapat	2024-01-18 11:28:49	Re: Report planning memory in EXPLAIN ANALYZE