Re: [HACKERS] make async slave to wait for lsn to be replayed

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Kartyshov Ivan <i(dot)kartyshov(at)postgrespro(dot)ru>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Date: 2024-06-19 20:08:50
Message-ID: CAPpHfdvDjrM6=XxDiuZBa+bg2qL34jwafa3uXf0Zsxm2PKOChg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
> <i(dot)kartyshov(at)postgrespro(dot)ru> wrote:
> >
> > Hi, Alexander, Here, I made some improvements according to your
> > discussion with Heikki.
> >
> > On 2024-04-11 18:09, Alexander Korotkov wrote:
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
> > > wrote:
> > >> In a nutshell, it's possible for the loop in WaitForLSN to exit
> > >> without
> > >> cleaning up the process from the heap. I was able to hit that by
> > >> adding
> > >> a delay after the addLSNWaiter() call:
> > >>
> > >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
> > >>
> > >> I think there's a similar race condition if the timeout is reached at
> > >> the same time that the startup process wakes up the process.
> > >
> > > Thank you for catching this. I think WaitForLSN() just needs to call
> > > deleteLSNWaiter() unconditionally after exit from the loop.
> >
> > Fix and add injection point test on this race condition.
> >
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
> > > wrote:
> > >> The docs could use some-copy-editing, but just to point out one issue:
> > >>
> > >> > There are also procedures to control the progress of recovery.
> > >>
> > >> That's copy-pasted from an earlier sentence at the table that lists
> > >> functions like pg_promote(), pg_wal_replay_pause(), and
> > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
> > >> the
> > >> progress of recovery like those functions do, it only causes the
> > >> calling
> > >> backend to wait.
> >
> > Fix documentation and add extra tests on multi-standby replication
> > and cascade replication.
>
> Thank you for the revised patch.
>
> I see couple of items which are not addressed in this revision.
> * As Heikki pointed, that it's currently not possible in one round
> trip to call call pg_wal_replay_wait() and do other job. The attached
> patch addresses this. It milds the requirement. Now, it's not
> necessary to be in atomic context. It's only necessary to have no
> active snapshot. This new requirement works for me so far. I
> appreciate a feedback on this.
> * As Alvaro pointed, multiple waiters case isn't covered by the test
> suite. That leads to no coverage of some code paths. The attached
> patch addresses this by adding a test case with multiple waiters.
>
> The rest looks good to me.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all. It seems useless.
2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This
location seems to be unachievable in this test. So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki. That is to insert the injection point at the
beginning of WaitLSNSetLatches().

Links.
1. https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920

------
Regards,
Alexander Korotkov
Supabase

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2024-06-19 20:35:29 Re: Schema variables - new implementation for Postgres 15
Previous Message Alexander Lakhin 2024-06-19 20:00:00 Re: Failures in constraints regression test, "read only 0 of 8192 bytes"