From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Improve error handling for invalid slots and ensure a same 'inactive_since' time for inactive slots |
Date: | 2025-01-31 01:13:09 |
Message-ID: | CAHut+PvEEdUgVS3SkEYhDw8_HKcSqmGytVyF-+sZaQSce_8hpQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 31, 2025 at 11:19 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
...
> On Fri, Jan 31, 2025 at 1:02 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Thu, 30 Jan 2025 at 16:02, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > I have made minor changes in the attached. The main change is to raise
> > > an ERROR before we broadcast to let everybody know we've modified this
> > > slot. There is no point in broadcasting if the slot is unusable.
> > >
> > > > - Updated the test files to match the new error message.
> > > >
> > >
> > > - /ERROR: cannot alter invalid replication slot "vacuum_full_inactiveslot"/
> > > + /ERROR: can no longer access replication slot "vacuum_full_inactiveslot"/
> > >
> > > - "can no longer get changes from replication slot
> > > \"shared_row_removal_activeslot\""
> > > + "can no longer access replication slot \"shared_row_removal_activeslot\""
> > >
> > > With the above changes, we made the ERROR message generic which was
> > > required to raise it from the central place. This looks reasonable to
> > > me. The other option that occurred to me is to write it as: "can no
> > > longer use replication slot "vacuum_full_inactiveslot" to access WAL
> > > or modify it" but I find the one used currently in the patch better as
> > > this is longer and may need modification in future if we start using
> > > slot for something else.
> >
> > One of the test was failing because MyReplicationSlot was not set to s
> > before erroring out which resulted in:
> > TRAP: failed Assert("s->data.persistency == RS_TEMPORARY"), File:
> > "slot.c", Line: 777, PID: 280121
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(ExceptionalCondition+0xbb)[0x578612cd9289]
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(ReplicationSlotCleanup+0x12b)[0x578612a32348]
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(WalSndErrorCleanup+0x5e)[0x578612a41995]
> >
> > Fixed it by setting MyReplicationSlot just before erroring out so that
> > WalSndErrorCleanup function will have access to it. I also moved the
> > error reporting above as there is no need to check for slot is active
> > for pid in case of invalidated slots. The attached patch has the
> > changes for the same.
> >
>
> 3.
> + /* We made this slot active, so it's ours now. */
> + MyReplicationSlot = s;
> +
> + /* Invalid slots can't be modified or used before accessing the WAL. */
> + if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("can no longer access replication slot \"%s\"",
> + NameStr(s->data.name)),
> + errdetail("This replication slot has been invalidated due to \"%s\".",
> + SlotInvalidationCauses[s->data.invalidated]));
> +
>
> This change looked dubious. I understand that it was done to avoid a
> TRAP, but I am not sure if this is the correct fix. At the point of
> that ereport ERROR this slot doesn't yet belong to us because we are
> still determining *can* we acquire this slot or not, so I felt it
> doesn't seem correct to have the MyReplicationSlot code/comment ("it's
> ours now") come before the ERROR.
>
> Furthermore, having the code/comment "We made this slot active, so
> it's ours now" ahead of the other code/comment "... but it's already
> active in another process..." is contradictory.
>
Hi Nisha.
Further to my previous post, I was experimenting with an alternate fix
for the TAP test error reported by Vignesh.
Please see my diffs -- apply this atop the v4-0001 patch.
Basically, I have just moved the invalidation ereport to be earlier in
the code. IIUC, the TRAP might have been due to the v4-0001 still
allowing the slot active_pid being modified to MyProcPid prematurely
even when we did not end up acquiring the invalidated slot.
Note, I also added a LWLockRelease to match the existing/adjacent ereport.
Anyway, please consider it. The recovery and subscription TAP test are
working for me.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
Attachment | Content-Type | Size |
---|---|---|
PS_20250131_v4_updates.txt | text/plain | 1.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-01-31 01:19:05 | postgresql.conf.sample ordering for IO, worker related GUCs |
Previous Message | Michael Paquier | 2025-01-31 00:21:53 | Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData |