Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: depesz(at)depesz(dot)com
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-15 14:59:37
Message-ID: CAD21AoAC7NvyFvTfDxy-bK6Dvnm8D=-TG0c7B6==W5vhA0dtxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Nov 15, 2022 at 6:01 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> <depesz(at)depesz(dot)com> wrote:
> >
> > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > <depesz(at)depesz(dot)com> wrote:
> > > >
> > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18
> > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
> > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> > >
> > > It happened twice that the walsender process was active for at least
> > > one minute while not advancing restart_lsn and exited. Could you share
> > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > can find out what the wal sender processes were doing?
> >
> > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > lot of logs to check if I don't leak any sensitive info.
> >
> > But, if I'll remove lines that are generated by walsender to pg12
> > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > only ~ 3MB. Still a bit too much, as there can be some queries with
> > sensitive info.
> >
> > Perhaps I can grep something out?
> >
> > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > please share it as well.
> >
> > Found one in the time frame you mentioned:
> > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
> >
>
> Thanks!
>
> LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> slot's restart_lsn to this LSN, we remove WAL files older than
> 000000000001039D0000008A.
>
> In LogicalIncreaseRestartDecodingForSlot(), since
> "current_lsn(1039D/9170B010) <
> slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> part and called LogicalConfirmReceivedLocation():
>
> else if (current_lsn <= slot->data.confirmed_flush)
> {
> slot->candidate_restart_valid = current_lsn;
> slot->candidate_restart_lsn = restart_lsn;
>
> /* our candidate can directly be used */
> updated_lsn = true;
> }
>
> Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:
>
> if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
> MyReplicationSlot->candidate_restart_valid <= lsn)
> {
> Assert(MyReplicationSlot->candidate_restart_lsn !=
> InvalidXLogRecPtr);
>
> MyReplicationSlot->data.restart_lsn =
> MyReplicationSlot->candidate_restart_lsn;
> MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
> MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
> updated_restart = true;
> }
>
> So you should have logs for "updated xmin: %u restart: %u", could you check it?
>
> After the above codes, we do:
>
> SpinLockRelease(&MyReplicationSlot->mutex);
>
> /* first write new xmin to disk, so we know what's up after a crash */
> if (updated_xmin || updated_restart)
> {
> ReplicationSlotMarkDirty();
> ReplicationSlotSave();
> elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
> updated_restart);
> }
>
> Is something like the following scenario possible to happen?
>
> 1. wal sender updates slot's restart_lsn and releases the spin lock
> (not saved in the disk yet)
> 2. someone updates slots' minimum restart_lsn (note that slot's
> restart_lsn in memory is already updated).
> 3. checkpointer removes WAL files older than the minimum restart_lsn
> calculated at step 2.
> 4. wal sender restarts for some reason (or server crashed).
> 5. after the restarts, wal sender acquires the slot but the slot's
> restart_lsn is old.
> 6. WAL files required for restart the decoding are already removed.

I'm not sure that the reported issue happened in the above scenario
but I've confirmed this scenario happened on my environment by adding
sleeps and killing the wal sender process. After the server restarts,
pg_replication_slots.restart_lsn retreated from 0/5390F78 to
0/1659FC0, and the walsender raised an error "requested WAL segment
000000010000000000000001 has already been removed".

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-11-15 20:00:38 BUG #17688: The Database Cluster Initialization Failed
Previous Message Masahiko Sawada 2022-11-15 09:01:27 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.