From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Network failure may prevent promotion |
Date: | 2024-01-18 08:26:31 |
Message-ID: | 20240118.172631.1740094280436463079.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Sun, 31 Dec 2023 20:07:41 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> We've noticed that when walreceiver is waiting for a connection to
> complete, standby does not immediately respond to promotion
> requests. In PG14, upon receiving a promotion request, walreceiver
> terminates instantly, but in PG16, it waits for connection
> timeout. This behavior is attributed to commit 728f86fec65, where a
> part of libpqrcv_connect was simply replaced with a call to
> libpqsrc_connect_params. This behavior can be verified by simply
> dropping packets from the standby to the primary.
Apologize for the inconvenience on my part, but I need to fix this
behavior. To continue this discussion, I'm providing a repro script
here.
With the script, the standby is expected to promote immediately,
emitting the following log lines:
standby.log:
> 2024-01-18 16:25:22.245 JST [31849] LOG: received promote request
> 2024-01-18 16:25:22.245 JST [31850] FATAL: terminating walreceiver process due to administrator command
> 2024-01-18 16:25:22.246 JST [31849] LOG: redo is not required
> 2024-01-18 16:25:22.246 JST [31849] LOG: selected new timeline ID: 2
> 2024-01-18 16:25:22.274 JST [31849] LOG: archive recovery complete
> 2024-01-18 16:25:22.275 JST [31847] LOG: checkpoint starting: force
> 2024-01-18 16:25:22.277 JST [31846] LOG: database system is ready to accept connections
> 2024-01-18 16:25:22.280 JST [31847] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/1548E98, redo lsn=0/1548E40
> 2024-01-18 16:25:22.356 JST [31846] LOG: received immediate shutdown request
> 2024-01-18 16:25:22.361 JST [31846] LOG: database system is shut down
After 728f86fec65 was introduced, promotion does not complete with the
same operation, as follows. The patch attached to the previous mail
fixes this behavior to the old behavior above.
> 2024-01-18 16:47:53.314 JST [34515] LOG: received promote request
> 2024-01-18 16:48:03.947 JST [34512] LOG: received immediate shutdown request
> 2024-01-18 16:48:03.952 JST [34512] LOG: database system is shut down
The attached script requires that sudo is executable. And there's
another point to note. The script attempts to establish a replication
connection to $primary_address:$primary_port. To packet-filter can
work, it must be a remote address that is accessible when no
packet-filter setting is set up. The firewall-cmd setting, need to be
configured to block this connection. If simply an inaccessible IP
address is set, the process will fail immediately with a "No route to
host" error before the first packet is sent out, and it will not be
blocked as intended.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
promote_test.pl | text/plain | 1.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2024-01-18 08:27:55 | Re: Build versionless .so for Android |
Previous Message | Anthonin Bonnefoy | 2024-01-18 08:25:16 | Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx |