From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: Assertion failure in SnapBuildInitialSnapshot() |
Date: | 2023-01-30 04:57:14 |
Message-ID: | CAA4eK1LU6jsjtqvh0hbuqXPUn8N+rf4c5-Urh3S17p+MTQvJew@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jan 29, 2023 at 9:15 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Amit, Sawada-san,
> >
> > I have also reproduced the failure on PG15 with some debug log, and I agreed that
> > somebody changed procArray->replication_slot_xmin to InvalidTransactionId.
> >
> > > > The same assertion failure has been reported on another thread[1].
> > > > Since I could reproduce this issue several times in my environment
> > > > I've investigated the root cause.
> > > >
> > > > I think there is a race condition of updating
> > > > procArray->replication_slot_xmin by CreateInitDecodingContext() and
> > > > LogicalConfirmReceivedLocation().
> > > >
> > > > What I observed in the test was that a walsender process called:
> > > > SnapBuildProcessRunningXacts()
> > > > LogicalIncreaseXminForSlot()
> > > > LogicalConfirmReceivedLocation()
> > > > ReplicationSlotsComputeRequiredXmin(false).
> > > >
> > > > In ReplicationSlotsComputeRequiredXmin() it acquired the
> > > > ReplicationSlotControlLock and got 0 as the minimum xmin since there
> > > > was no wal sender having effective_xmin.
> > > >
> > >
> > > What about the current walsender process which is processing
> > > running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
> > > slot's effective_xmin have a non-zero value? If not, then why?
> >
> > Normal walsenders which are not for tablesync create a replication slot with
> > NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
> > called with need_full_snapshot = false, and slot->effective_xmin is not updated.
>
> Right. This is how we create a slot used by an apply worker.
>
I was thinking about how that led to this problem because
GetOldestSafeDecodingTransactionId() ignores InvalidTransactionId. It
seems that is possible when both builder->xmin and
replication_slot_catalog_xmin precede replication_slot_catalog_xmin.
Do you see any different reason for it?
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-01-30 05:06:10 | Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl |
Previous Message | John Naylor | 2023-01-30 04:31:41 | Re: [PoC] Improve dead tuple storage for lazy vacuum |