Re: Fix 035_standby_logical_decoding.pl race conditions

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix 035_standby_logical_decoding.pl race conditions
Date: 2025-04-02 08:36:46
Message-ID: Z+z3Hs765Ixo0mN7@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Kuroda-san,

On Wed, Apr 02, 2025 at 07:16:25AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Amit, Bertrand,
>
> > You have not added any injection point for the above case. Isn't it
> > possible that if running_xact record is logged concurrently to the
> > pruning record, it should move the active slot on standby, and the
> > same failure should occur in this case as well?
>
> I considered that the timing failure can happen. Reproducer:
>
> ```
> $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', 'CHECKPOINT');
> +sleep(20);
> $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> ```

Yeah, I was going to provide the exact same reproducer and then saw your email.

> Based on the fact, I've updated to use injection_points for scenario 5. Of course,
> PG16/17 patches won't use the active slot for that scenario.

Thanks for the updated patch!

As far v4-0001:

=== 1

+# would advance an active replication slot's catalog_xmin

s/would/could/? I mean the system also needs to be "slow" enough (so the
sleep() in the reproducer)

=== 2

+# Injection_point is used to avoid seeing an xl_running_xacts even here. In
+# scenario 5, we verify the case that the backend process detects the page has
+# enough tuples; thus, page pruning happens. If the record is generated just
+# before doing on-pruning, the catalog_xmin of the active slot would be
+# updated; hence, the conflict would not occur.

Not sure we need to explain what scenario 5 does here, but that does not hurt
if you feel the need.

Also maybe mention the last update in the comment and add some nuance (like
proposed in === 1), something like?

"
# Injection_point is used to avoid seeing a xl_running_xacts here. Indeed,
# if it is generated between the last 2 updates then the catalog_xmin of the active
# slot could be updated; hence, the conflict could not occur.
"

Apart from that the tests looks good to me and all the problematic scenarios
covered.

As far PG17-v4-0001:

=== 3

-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's

why?

=== 4

It looks like that check_slots_conflict_reason() is not called with checks_active_slot
as argument.

=== 5

I think that we could remove the need for the drop_active_slot parameter in
drop_logical_slots() and just check if an active slot exists (and if so drop
it). That said I'm not sure it's worth to go that far for backpatching.

As far PG16-v4:

=== 6

Same as === 3 and === 5 (=== 4 does not apply as check_slots_conflict_reason()
does not exist).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2025-04-02 08:45:22 Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
Previous Message torikoshia 2025-04-02 08:33:27 Re: Change log level for notifying hot standby is waiting non-overflowed snapshot