Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com
Subject: Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date: 2024-03-05 10:17:03
Message-ID: ZebxH6YJ8mfB5rJ6@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
> On Mon, Feb 26, 2024 at 02:01:45PM +0000, Bertrand Drouvot wrote:
> > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now
> > (see the attached).
>
> I have looked at what you have here.

Thanks!

> First, in a build where 818fefd8fd is included, this makes the test
> script a lot slower. Most of the logic is quick, but we're spending
> 10s or so checking that catalog_xmin has advanced. Could it be
> possible to make that faster?

Yeah, v2 attached changes this. It moves the injection point after the process
has been killed so that another process can decode at wish (without the need
to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation().

> A second issue is the failure mode when 818fefd8fd is reverted. The
> test is getting stuck when we are waiting on the standby to catch up,
> until a timeout decides to kick in to fail the test, and all the
> previous tests pass. Could it be possible to make that more
> responsive? I assume that in the failure mode we would get an
> incorrect conflict_reason for injection_inactiveslot, succeeding in
> checking the failure.

I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1"
before the initial_* assignements). The issue is that then the new ASSERT is
triggered leading to the standby shutdown. So, I'm not sure how to improve this
case.

> + my $terminated = 0;
> + for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
> + {
> + if ($node_standby->log_contains(
> + 'terminating process .* to release replication slot \"injection_activeslot\"', $logstart))
> + {
> + $terminated = 1;
> + last;
> + }
> + usleep(100_000);
> + }
> + ok($terminated, 'terminating process holding the active slot is logged with injection point');
>
> The LOG exists when we are sure that the startup process is waiting
> in the injection point, so this loop could be replaced with something
> like:
> + $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
> + ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';)
>

Yeah, now that wait_for_event() is there, let's use it: done in v2.

> Nit: the name of the injection point should be
> terminate-process-holding-slot rather than
> TerminateProcessHoldingSlot, to be consistent with the other ones.

Done in v2.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v2-0001-Add-regression-test-for-818fefd8fd.patch text/x-diff 6.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-03-05 10:30:16 Re: Change GUC hashtable to use simplehash?
Previous Message zwj 2024-03-05 10:04:33 回复: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)