Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: depesz(at)depesz(dot)com, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-17 01:17:59
Message-ID: CAD21AoAuZm-UR7urLujjd5J+fPpH8wmGDWe+rEZJz2DjEg28xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > > > > Found one in the time frame you mentioned:
> > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
> > > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > > > 000000000001039D0000008A.
> > > > >
> > > > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > > > "current_lsn(1039D/9170B010) <
> > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > > > part and called LogicalConfirmReceivedLocation():
> > > > >
> > > > > else if (current_lsn <= slot->data.confirmed_flush)
> > > > > {
> > > > > slot->candidate_restart_valid = current_lsn;
> > > > > slot->candidate_restart_lsn = restart_lsn;
> > > > >
> > > > > /* our candidate can directly be used */
> > > > > updated_lsn = true;
> > > > > }
> > > > >
> > > >
> > > > If this would have been executed in
> > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > > > the above DEBUG messages "current candidate 1039D/83825958, current
> > > > after 1039D/8B5773D8" should be the same as proposed and after
> > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > > > something?
> > >
> > > Oh, you're right.
> > >
> > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> > > to or greater than 1039D/8B5773D8 at that time but
> > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
> > >
> >
> > Yes, that is weird but it had been a bit obvious if the same LOG would
> > have printed slot->data.restart_lsn. This means that somehow slot's
> > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
> > figure out yet how that can happen but if that happens then the slot's
> > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
> > don't check if slot's candidate_restart_lsn is lesser than its
> > restart_lsn before assigning the same in line
> > MyReplicationSlot->data.restart_lsn =
> > MyReplicationSlot->candidate_restart_lsn;. I think that should be
> > checked. Note that we call LogicalConfirmReceivedLocation() can be
> > called from ProcessStandbyReplyMessage(), so once the wrong
> > candidate_restart_lsn is set, it can be assigned to restart_lsn from
> > other code paths as well.
> >
> > I am not able to think how 'candidate_restart_lsn' can be set to an
> > LSN value prior to 'restart_lsn'.
> >
>
> In the below part of the code, we use the value of
> 'last_serialized_snapshot' for restart_lsn.
> else if (txn == NULL &&
> builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
> builder->last_serialized_snapshot != InvalidXLogRecPtr)
> LogicalIncreaseRestartDecodingForSlot(lsn,
> builder->last_serialized_snapshot);
>
> Now, say, after restart, we start reading from slot's restart_lsn
> which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
> the last_seriealized_snapshot set to 1039D/83825958.

I don't get this part. You meant that the snapshot we restored at
1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958? I
think we don't update last_serialized_snapshot when restoring the
snapshot. Also, if we restore the snapshot after restart, we don't
execute the above code since SnapBuildFindSnapshot() returns false,
no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-11-17 04:06:47 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message David G. Johnston 2022-11-16 21:02:40 Re: jsonb_path_query unexpected results when filter key / path does not exist..