From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com |
Subject: | Re: Fix race condition in InvalidatePossiblyObsoleteSlot() |
Date: | 2024-03-07 00:54:24 |
Message-ID: | ZekQQHCrIqLVpGz5@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 06, 2024 at 05:45:56PM +0000, Bertrand Drouvot wrote:
> Thank you both for the report! I did a few test manually and can see the issue
> from times to times. When the issue occurs, the logical decoding was able to
> go through the place where LogicalConfirmReceivedLocation() updates the
> slot's catalog_xmin before being killed. In that case I can see that the
> catalog_xmin is updated to the xid horizon.
Two buildfarm machines have complained here, and one of them twice in
a row. That's quite amazing, because a couple of dozen runs done in a
row on the same host as these animals all pass. The CI did not
complain either (did 2~3 runs there yesterday).
> Means in a failed test we have something like:
>
> slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."
>
> While when the test is ok you'll see something like:
>
> slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."
Perhaps we should also make the test report the catalog_xmin of the
slot. That may make debugging a bit easier.
> In the failing test the call to SELECT pg_logical_slot_get_changes() does
> not advance the slot's catalog xmin anymore.
Is that something that we could enforce in the test in a stronger way,
cross-checking the xmin horizon before and after the call?
> To fix this, I think we need a new transacion to decode from the primary before
> executing pg_logical_slot_get_changes(). But this transaction has to be replayed
> on the standby first by the startup process. Which means we need to wakeup
> "terminate-process-holding-slot" and that we probably need another injection
> point somewehere in this test.
>
> I'll look at it unless you've another idea?
I am wondering if there is something else lurking here, actually, so
for now I am going to revert the change as it is annoying to get
sporadic failures in the CF bot at this time of the year and there are
a lot of patches under discussion. Let's give it more time and more
thoughts, without pressure.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2024-03-07 02:04:51 | Re: Synchronizing slots from primary to standby |
Previous Message | Thomas Munro | 2024-03-07 00:37:36 | Re: Combine headerscheck and cpluspluscheck scripts |