RE: Fix 035_standby_logical_decoding.pl race conditions

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Bertrand Drouvot' <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Fix 035_standby_logical_decoding.pl race conditions
Date: 2025-03-21 12:28:10
Message-ID: OSCPR01MB14966852B0E4CF07D42774695F5DB2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Bertrand,

I'm also working on the thread to resolve the random failure.

> Yes, that's also my understanding. It's also easy to "simulate" by adding
> a checkpoint on the primary and a long enough sleep after we launched our sql in
> wait_until_vacuum_can_remove().

Thanks for letting me know. For me, it could be reporoduced only the sleep().

> > So, if the above is correct, the reason for generating extra
> > xl_running_xacts on primary is Vacuum followed by Insert on primary
> > via below part of test:
> > $node_primary->safe_psql(
> > 'testdb', qq[VACUUM $vac_option verbose $to_vac;
> > INSERT INTO flush_wal DEFAULT VALUES;]);
>
> I'm not sure, I think a xl_running_xacts could also be generated (for example by
> the checkpointer) before the vacuum (should the system be slow enough).

I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs,
I got the below ordering. This meant that RUNNING_XACTS are generated before the
prune triggered by the vacuum.
```
...
lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766
...
lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,...
...
```

> I'm not sure, as I think a xl_running_xacts could still be generated after
> we execute "our sql" meaning:
>
> "
> $node_primary->safe_psql('testdb', qq[$sql]);
> "
>
> and before we launch the new DML. In that case I guess the issue could still
> happen.
>
> OTOH If we create the new DML "before" we launch "our sql" then the test
> would also fail for both active and inactive slots because that would not
> invalidate any slots.
>
> I did observe the above with the attached changes (just changing the PREPARE
> TRANSACTION location).

I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.

> I agree, but I'm not sure it's doable as it looks to me that we should prevent
> the catalog xmin to advance to advance past the conflict point while still
> generating a conflict point. Will try to give it another thought.

One primitive idea for me was to stop the walsender/pg_recvlogical process for a while.
SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows.
See 019_replslot_limit.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2025-03-21 12:29:58 Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
Previous Message Andrey Borodin 2025-03-21 12:19:35 Re: Using read_stream in index vacuum