Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: amit(dot)kapila16(at)gmail(dot)com
Cc: depesz(at)depesz(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-10-17 06:53:07
Message-ID: 20221017.155307.3782702886583901.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> <depesz(at)depesz(dot)com> wrote:
> > At 2:20:48pm there was:
> > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
> >
> > But these errors happened earlier too (minutes/hours earlier).
> >
> > But then, 3 minutes later I got:
> >
> > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.",,,,,,,,,"","logical replication worker",,0
> > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
> > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
> > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
> > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
> >
> >
> > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
> >
>
> This is quite strange and I am not able to see the reason why this can
> happen. The only way to debug this problem that comes to mind is to
> add more LOGS around the code that removes the WAL files. For example,
> we can try to print the value of minimumslotLSN (keep) and logSegNo in
> KeepLogSeg().

What I can see from the information at hand is that walsender is
killed (not died) twice for some reasons then some "reserved" WAL
files are lost. I don't see the reason for the second walsender (pid
29207) being killed but if the "has already been removed" log line
above is the first one, the file removal is probably happend AT
14:23:10. (log is needed to know that.)

Anyway, I didn't come up with a possibility other than of someone else
having removed those files.. For example, virus scan softwares. If
any such software kills processes that is grabbing the files that the
software identified as malicious and is going to remove, that could be
happen?

> > So, 4 files are missing.
> >
> > These were archived properly, and I tried to restore them from archive, and put
> > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > slot with the same message about "requested WAL segment
> > 000000010000CA0A00000049 has already been removed"
> >
>
> I think you are seeing this behavior because we update the
> lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> then we use that to give the error you are seeing.

lastRemovedSegNo is updated once per one segment of removal. Four
files are lost in this case.

> > I know that at this moment we're hosed. This replication was removed,
> > everythbing cleared (after all, we accumulated already 300G of wal, and
> > we can't extend the FS forever, so we needed to cleanup the situation).
> >
> > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
> > and the fact that it did means that either we did something wrong (what,
> > though?), or there is some kind of bug in Pg.
> >
> > Anyone has any idea on how to investigate/fix that?
> >
>
> As suggested above, I feel we need to add some LOGS to see what is
> going on unless you have a reproducible testcase.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Julien Rouhaud 2022-10-17 08:15:16 Re: BUG #17645: Strange queries that stuck in database system
Previous Message Julien Rouhaud 2022-10-17 03:49:28 Re: BUG #17636: terminating connection because of crash of another server process