logical replication walsender loop preventing a clean shutdown

From: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: logical replication walsender loop preventing a clean shutdown
Date: 2024-09-16 18:27:42
Message-ID: CAKAnmm+STYvW_5aRx2C0QWgbNpd_zEjruc6MytePnRuK8oKtTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

When doing logical replication, a large transaction can prevent the
postgres process from shutting down until the WAL has all been processed
and the client reports back. This is obviously less than ideal, as it means
a pg_ctl stop -m fast can take minutes or hours to complete. I would expect
the behavior to be that all backends are signalled so they can leave
cleanly.

I found this thread that reports something very similar (but without the
infinite looping):

Subject: walsender bug: stuck during shutdown
https://www.postgresql.org/message-id/flat/20201123205253.GA10075%40alvherre.pgsql

I have cc'd Alvaro in case he has any progress on this, or ideas. I tried
applying the patch from that thread, but the behavior remained unchanged.
Wanted to raise this in -bugs for added visibility, and also see if anyone
had thoughts before I dig deeper.

My test case (tested with latest, as of commit
b8ea0f675f35c3f0c2cf62175517ba0dacad4abd)

* Spin up a cluster, port 5555, using wal_level logical
* pg_recvlogical --create-slot -d postgres -p 5555 --slot=foo
* pg_recvlogical --start -d postgres -p 5555 --slot=foo --file /tmp/tmp
* If all is well, ctrl-z, bg 1, watch -n 3 tail /tmp/tmp

Other session:
* psql -p 5555 postgres
* create table t (id int generated always as identity, foo text);
* insert into t(foo) select 'abcdefghijklmnopqrstuvwxyz' from
generate_series(1,10_000_000);

Once the commit finishes, and as soon as pc_recvlogical starts processing
it:

* time pg_ctl stop -m fast -w -t 10000

I found 10 million a nice test on my system - shutdown takes an additional
50 seconds or so, as it waits for pg_recvlogical to respond.

Cheers,
Greg

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-09-16 18:54:50 BUG #18620: Problem: Slow Delete Operation
Previous Message Tom Lane 2024-09-16 17:28:30 Re: BUG #18618: pg_upgrade from 14 to 15+ fails for unlogged table with identity column