Re: Fix 035_standby_logical_decoding.pl race conditions

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Fix 035_standby_logical_decoding.pl race conditions
Date: 2025-03-19 10:26:05
Message-ID: Z9qbvRt1ghOPvS1/@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 19, 2025 at 12:12:19PM +0530, Amit Kapila wrote:
> On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > Please find attached a patch to $SUBJECT.
> >
> > In rare circumstances (and on slow machines) it is possible that a xl_running_xacts
> > is emitted and that the catalog_xmin of a logical slot on the standby advances
> > past the conflict point. In that case, no conflict is reported and the test
> > fails. It has been observed several times and the last discussion can be found
> > in [1].
> >
>

Thanks for looking at it!

> Is my understanding correct that bgwriter on primary node has created
> a xl_running_xacts, then that record is replicated to standby, and
> while decoding it (xl_running_xacts) on standby via active_slot, we
> advanced the catalog_xmin of active_slot? If this happens then the
> replay of vacuum record on standby won't be able to invalidate the
> active slot, right?

Yes, that's also my understanding. It's also easy to "simulate" by adding
a checkpoint on the primary and a long enough sleep after we launched our sql in
wait_until_vacuum_can_remove().

> So, if the above is correct, the reason for generating extra
> xl_running_xacts on primary is Vacuum followed by Insert on primary
> via below part of test:
> $node_primary->safe_psql(
> 'testdb', qq[VACUUM $vac_option verbose $to_vac;
> INSERT INTO flush_wal DEFAULT VALUES;]);

I'm not sure, I think a xl_running_xacts could also be generated (for example by
the checkpointer) before the vacuum (should the system be slow enough).

> > Remarks:
> >
> > R1. The issue still remains in v16 though (as injection points are available since
> > v17).
> >
>
> This is not idle case because the test would still keep failing
> intermittently on 16.

I do agree.

> I am wondering what if we start a transaction
> before vacuum and do some DML in it but didn't commit that xact till
> the active_slot test is finished then even the extra logging of
> xl_running_xacts shouldn't advance xmin during decoding.

I'm not sure, as I think a xl_running_xacts could still be generated after
we execute "our sql" meaning:

"
$node_primary->safe_psql('testdb', qq[$sql]);
"

and before we launch the new DML. In that case I guess the issue could still
happen.

OTOH If we create the new DML "before" we launch "our sql" then the test
would also fail for both active and inactive slots because that would not
invalidate any slots.

I did observe the above with the attached changes (just changing the PREPARE
TRANSACTION location).

> we should try to find some solution which could be
> backpatched to 16 as well.

I agree, but I'm not sure it's doable as it looks to me that we should prevent
the catalog xmin to advance to advance past the conflict point while still
generating a conflict point. Will try to give it another thought.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
test_prepared_txn.txt text/plain 3.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2025-03-19 10:48:40 Re: [PoC] Reducing planning time when tables have many partitions
Previous Message Ilia Evdokimov 2025-03-19 10:21:54 Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option