From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Improve error handling for invalid slots and ensure a same 'inactive_since' time for inactive slots |
Date: | 2025-01-31 00:19:32 |
Message-ID: | CAHut+Pv-bqmqKNmFJ9TfuN0UpoMzeXNaD_dLveLvcPgpF9DKBQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Some review comments for patch v4-0001.
======
src/backend/replication/logical/logical.c
CreateDecodingContext:
1.
Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
Assert(MyReplicationSlot->data.restart_lsn != InvalidXLogRecPtr);
~
These lines are adjacent to the patch change. It's inconsistent to
refer to 'MyReplicationSlot' here. Everywhere else in this function
deliberated using variable 'slot' to keep the code shorter. In passing
we should change this too.
======
src/backend/replication/slot.c
2.
+ *
+ * An error is raised if error_if_invalid is true and the slot is found to
+ * be invalid. It should always be set to true, except when we are temporarily
+ * acquiring the slot and doesn't intend to change it.
*/
typo /and doesn't intend to change it/and don't intend to change it/
On Fri, Jan 31, 2025 at 1:02 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, 30 Jan 2025 at 16:02, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > I have made minor changes in the attached. The main change is to raise
> > an ERROR before we broadcast to let everybody know we've modified this
> > slot. There is no point in broadcasting if the slot is unusable.
> >
> > > - Updated the test files to match the new error message.
> > >
> >
> > - /ERROR: cannot alter invalid replication slot "vacuum_full_inactiveslot"/
> > + /ERROR: can no longer access replication slot "vacuum_full_inactiveslot"/
> >
> > - "can no longer get changes from replication slot
> > \"shared_row_removal_activeslot\""
> > + "can no longer access replication slot \"shared_row_removal_activeslot\""
> >
> > With the above changes, we made the ERROR message generic which was
> > required to raise it from the central place. This looks reasonable to
> > me. The other option that occurred to me is to write it as: "can no
> > longer use replication slot "vacuum_full_inactiveslot" to access WAL
> > or modify it" but I find the one used currently in the patch better as
> > this is longer and may need modification in future if we start using
> > slot for something else.
>
> One of the test was failing because MyReplicationSlot was not set to s
> before erroring out which resulted in:
> TRAP: failed Assert("s->data.persistency == RS_TEMPORARY"), File:
> "slot.c", Line: 777, PID: 280121
> postgres: primary: walsender vignesh [local]
> START_REPLICATION(ExceptionalCondition+0xbb)[0x578612cd9289]
> postgres: primary: walsender vignesh [local]
> START_REPLICATION(ReplicationSlotCleanup+0x12b)[0x578612a32348]
> postgres: primary: walsender vignesh [local]
> START_REPLICATION(WalSndErrorCleanup+0x5e)[0x578612a41995]
>
> Fixed it by setting MyReplicationSlot just before erroring out so that
> WalSndErrorCleanup function will have access to it. I also moved the
> error reporting above as there is no need to check for slot is active
> for pid in case of invalidated slots. The attached patch has the
> changes for the same.
>
3.
+ /* We made this slot active, so it's ours now. */
+ MyReplicationSlot = s;
+
+ /* Invalid slots can't be modified or used before accessing the WAL. */
+ if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer access replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This replication slot has been invalidated due to \"%s\".",
+ SlotInvalidationCauses[s->data.invalidated]));
+
This change looked dubious. I understand that it was done to avoid a
TRAP, but I am not sure if this is the correct fix. At the point of
that ereport ERROR this slot doesn't yet belong to us because we are
still determining *can* we acquire this slot or not, so I felt it
doesn't seem correct to have the MyReplicationSlot code/comment ("it's
ours now") come before the ERROR.
Furthermore, having the code/comment "We made this slot active, so
it's ours now" ahead of the other code/comment "... but it's already
active in another process..." is contradictory.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-01-31 00:21:53 | Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData |
Previous Message | Masahiko Sawada | 2025-01-31 00:03:56 | Re: Skip collecting decoded changes of already-aborted transactions |