Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: depesz(at)depesz(dot)com
Cc: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-14 13:00:57
Message-ID: CAA4eK1+h8KAGW1bvo3vECWR9-ZExp1EptOBvqWP8V-aDyvaORg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Nov 14, 2022 at 5:47 PM hubert depesz lubaczewski
<depesz(at)depesz(dot)com> wrote:
>
> On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
> > > #v+
> > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
> ...
> > > #v-
> > >
> > > So, the whole set of missing files was recycled at the same time.
> > >
> > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
> > >
> > > #v+
> > > timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn
> > ...
> > ...
> > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
> > ...
> >
> > I think from the above two it is clear that the slot with restart_lsn
> > 1039D/83825958 is active and we seem to have recycled corresponding
> > segment.
> >
> > You may have the LOG for "attempting to remove WAL segments older than
> > log file %s", if so, please share. Also on similar lines, I think we
>
> Absolutely.
>
> There is something weird happening:
>

What exactly weird you are seeing in this? To me, it appears as if the
system due to some reason ignores an existing slot that has
restart_lsn as 1039D/83825958.

>
> If I'll add "recycled" lines we will get:
>
> #v+
> ...
> 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
> 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
>

Actually, the last LOG was what I was not expecting because of the
existence of a corresponding slot. This is what we have to find out
how it has computed the value "000000000001039D0000008A" for the file
to be removed.

> > need to add some LOGs before calling RemoveOldXlogFiles() to know
> > where our computation to remove files goes off. Something, like the
> > attached, would be helpful but note I have added this quickly on my
> > working branch to show you what I have in mind. You may need small
> > changes based on the branch you are working on.
>
> That will be complicated.
>
> Changing pg (recompile, and rerun) requires careful preparation with our
> customers. Which takes time.
>
> Additionally we are using precompiled binaries from pgdg, that makes the
> process slightly more involved.
>

I can understand but OTOH it is not easy to proceed without more information.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-11-14 13:11:59 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message hubert depesz lubaczewski 2022-11-14 12:17:03 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.