Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: depesz(at)depesz(dot)com, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-16 04:22:06
Message-ID: CAA4eK1+4sWqA2u5iiXaqVE6bbQxjWot90Uqf-KE4PtMv3NkuuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> <depesz(at)depesz(dot)com> wrote:
> >
> > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > <depesz(at)depesz(dot)com> wrote:
> > > >
> > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18
> > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
> > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > >
> > > It happened twice that the walsender process was active for at least
> > > one minute while not advancing restart_lsn and exited. Could you share
> > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > can find out what the wal sender processes were doing?
> >
> > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > lot of logs to check if I don't leak any sensitive info.
> >
> > But, if I'll remove lines that are generated by walsender to pg12
> > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > only ~ 3MB. Still a bit too much, as there can be some queries with
> > sensitive info.
> >
> > Perhaps I can grep something out?
> >
> > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > please share it as well.
> >
> > Found one in the time frame you mentioned:
> > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
> >
>
> Thanks!
>
> LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> slot's restart_lsn to this LSN, we remove WAL files older than
> 000000000001039D0000008A.
>
> In LogicalIncreaseRestartDecodingForSlot(), since
> "current_lsn(1039D/9170B010) <
> slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> part and called LogicalConfirmReceivedLocation():
>
> else if (current_lsn <= slot->data.confirmed_flush)
> {
> slot->candidate_restart_valid = current_lsn;
> slot->candidate_restart_lsn = restart_lsn;
>
> /* our candidate can directly be used */
> updated_lsn = true;
> }
>

If this would have been executed in
LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
the above DEBUG messages "current candidate 1039D/83825958, current
after 1039D/8B5773D8" should be the same as proposed and after
"proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
something?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2022-11-16 06:13:28 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message PG Bug reporting form 2022-11-15 20:00:38 BUG #17688: The Database Cluster Initialization Failed