Quick Links

RE: Fix 035_standby_logical_decoding.pl race conditions

From:	"Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To:	'Bertrand Drouvot' <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	RE: Fix 035_standby_logical_decoding.pl race conditions
Date:	2025-03-21 12:28:10
Message-ID:	OSCPR01MB14966852B0E4CF07D42774695F5DB2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Dear Bertrand,

I'm also working on the thread to resolve the random failure.

> Yes, that's also my understanding. It's also easy to "simulate" by adding
> a checkpoint on the primary and a long enough sleep after we launched our sql in
> wait_until_vacuum_can_remove().

Thanks for letting me know. For me, it could be reporoduced only the sleep().

> > So, if the above is correct, the reason for generating extra
> > xl_running_xacts on primary is Vacuum followed by Insert on primary
> > via below part of test:
> > $node_primary->safe_psql(
> > 'testdb', qq[VACUUM $vac_option verbose $to_vac;
> > INSERT INTO flush_wal DEFAULT VALUES;]);
>
> I'm not sure, I think a xl_running_xacts could also be generated (for example by
> the checkpointer) before the vacuum (should the system be slow enough).

I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs,
I got the below ordering. This meant that RUNNING_XACTS are generated before the
prune triggered by the vacuum.
```
...
lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766
...
lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,...
...
```

> I'm not sure, as I think a xl_running_xacts could still be generated after
> we execute "our sql" meaning:
>
> "
> $node_primary->safe_psql('testdb', qq[$sql]);
> "
>
> and before we launch the new DML. In that case I guess the issue could still
> happen.
>
> OTOH If we create the new DML "before" we launch "our sql" then the test
> would also fail for both active and inactive slots because that would not
> invalidate any slots.
>
> I did observe the above with the attached changes (just changing the PREPARE
> TRANSACTION location).

I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.

> I agree, but I'm not sure it's doable as it looks to me that we should prevent
> the catalog xmin to advance to advance past the conflict point while still
> generating a conflict point. Will try to give it another thought.

One primitive idea for me was to stop the walsender/pg_recvlogical process for a while.
SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows.
See 019_replslot_limit.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Re: Fix 035_standby_logical_decoding.pl race conditions at 2025-03-19 10:26:05 from Bertrand Drouvot

Responses

Re: Fix 035_standby_logical_decoding.pl race conditions at 2025-03-21 16:18:02 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	torikoshia	2025-03-21 12:29:58	Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
Previous Message	Andrey Borodin	2025-03-21 12:19:35	Re: Using read_stream in index vacuum