Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Date: 2024-05-30 04:48:11
Message-ID: CAHut+Pv0KcOpOEDL8cY7Si4AVfsCFhD--4V7HvMpitXcv6PEiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 30, 2024 at 2:09 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Wed, 29 May 2024 at 16:30, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > Hello hackers,
> >
> > As a recent buildfarm test failure [1] shows:
> > [14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
> > ### Stopping node "publisher" using mode fast
> > # Running: pg_ctl -D
> > /home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
> > -m fast stop
> > waiting for server to shut down.. ... ... ... .. failed
> > pg_ctl: server does not shut down
> > # pg_ctl stop failed: 256
> > # Postmaster PID for node "publisher" is 2222549
> > [14:39:04.375](362.001s) Bail out! pg_ctl stop failed
> >
> > 001_rep_changes_publisher.log
> > 2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
> > 2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
> > database=postgres host=[local]
> > 2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
> > 2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
> > 2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
> > 2223110) exited with exit code 1
> > 2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
> > 2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
> > 2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut down
> >
> > the publisher node may hang on stopping.
> >
> > I reproduced the failure (with aggressive autovacuum) locally and
> > discovered that it happens because:
> > 1) checkpointer calls WalSndInitStopping() (which sends
> > PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
> > WalSndWaitStopping() indefinitely, because:
> > 2) walsender receives the signal, sets got_STOPPING = true, but can't exit
> > WalSndLoop():
> > 3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
> > XLogSendLogical():
> > 4) it never sets WalSndCaughtUp to true:
> > 5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
> > XLogSendLogical():
> > 6) EndRecPtr doesn't advance in XLogNextRecord():
> > 7) XLogDecodeNextRecord() fails do decode a record that crosses a page
> > boundary:
> > 8) ReadPageInternal() (commented "Wait for the next page to become
> > available") constantly returns XLREAD_FAIL:
> > 9) state->routine.page_read()/logical_read_xlog_page() constantly returns
> > -1:
> > 10) flushptr = WalSndWaitForWal() stops advancing, because
> > got_STOPPING == true (see 2).
> >
> > That is, walsender doesn't let itself to catch up, if it gets the stop
> > signal when it's lagging behind and decoding a record requires reading
> > the next wal page.
> >
> > Please look at the reproducing test (based on 001_rep_changes.pl) attached.
> > If fails for me as below:
> > # 17
> > Bailout called. Further testing stopped: pg_ctl stop failed
> > FAILED--Further testing stopped: pg_ctl stop failed
> > make: *** [Makefile:21: check] Ошибка 255
>
> Thank you, Alexander, for sharing the script. I was able to reproduce
> the issue using the provided script. Furthermore, while investigating
> its origins, I discovered that this problem persists across all
> branches up to PG10 (the script needs slight adjustments to run it on
> older versions). It's worth noting that this issue isn't a result of
> recent version changes.
>

Hi,

FWIW using the provided scripting I was also able to reproduce the
problem on HEAD but for me, it was more rare. -- the script passed ok
3 times all 100 iterations; it eventually failed on the 4th run on the
75th iteration.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-05-30 06:43:41 Re: Improving the latch handling between logical replication launcher and worker processes.
Previous Message Andrei Lepikhov 2024-05-30 04:22:46 Re: POC: GROUP BY optimization