From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Bertrand Drouvot' <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | RE: Fix 035_standby_logical_decoding.pl race conditions |
Date: | 2025-03-21 12:28:10 |
Message-ID: | OSCPR01MB14966852B0E4CF07D42774695F5DB2@OSCPR01MB14966.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Bertrand,
I'm also working on the thread to resolve the random failure.
> Yes, that's also my understanding. It's also easy to "simulate" by adding
> a checkpoint on the primary and a long enough sleep after we launched our sql in
> wait_until_vacuum_can_remove().
Thanks for letting me know. For me, it could be reporoduced only the sleep().
> > So, if the above is correct, the reason for generating extra
> > xl_running_xacts on primary is Vacuum followed by Insert on primary
> > via below part of test:
> > $node_primary->safe_psql(
> > 'testdb', qq[VACUUM $vac_option verbose $to_vac;
> > INSERT INTO flush_wal DEFAULT VALUES;]);
>
> I'm not sure, I think a xl_running_xacts could also be generated (for example by
> the checkpointer) before the vacuum (should the system be slow enough).
I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs,
I got the below ordering. This meant that RUNNING_XACTS are generated before the
prune triggered by the vacuum.
```
...
lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766
...
lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,...
...
```
> I'm not sure, as I think a xl_running_xacts could still be generated after
> we execute "our sql" meaning:
>
> "
> $node_primary->safe_psql('testdb', qq[$sql]);
> "
>
> and before we launch the new DML. In that case I guess the issue could still
> happen.
>
> OTOH If we create the new DML "before" we launch "our sql" then the test
> would also fail for both active and inactive slots because that would not
> invalidate any slots.
>
> I did observe the above with the attached changes (just changing the PREPARE
> TRANSACTION location).
I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.
> I agree, but I'm not sure it's doable as it looks to me that we should prevent
> the catalog xmin to advance to advance past the conflict point while still
> generating a conflict point. Will try to give it another thought.
One primitive idea for me was to stop the walsender/pg_recvlogical process for a while.
SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows.
See 019_replslot_limit.pl.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
From | Date | Subject | |
---|---|---|---|
Next Message | torikoshia | 2025-03-21 12:29:58 | Re: Change log level for notifying hot standby is waiting non-overflowed snapshot |
Previous Message | Andrey Borodin | 2025-03-21 12:19:35 | Re: Using read_stream in index vacuum |