From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: 001_rep_changes.pl stalls |
Date: | 2020-04-14 01:38:49 |
Message-ID: | 20200414013849.GA886648@rfd.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote:
> Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
> important for promptly updating pg_stat_replication. When caught up, we
> should impose that logic before every sleep. The one-line fix is to sleep in
> WalSndLoop() only when pq_is_send_pending(), not when caught up.
This seems to have made the following race condition easier to hit:
https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
Now it happened eight times in three days, all on BSD machines:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?
There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01
Most-relevant lines of the test script:
$node_master->safe_psql('postgres',
"INSERT INTO decoding(blah) VALUES ('afterbb');");
$node_master->safe_psql('postgres', 'CHECKPOINT');
$node_master->stop('immediate');
The failure suggested the INSERT was not replicated before the immediate stop.
I can reproduce that consistently, before or after the fix for $SUBJECT, by
modifying walsender to delay 0.2s before sending WAL:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
#include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -2781,2 +2782,5 @@ retry:
+ PG_SETMASK(&BlockSig);
+ pg_usleep(200 * 1000);
+ PG_SETMASK(&UnBlockSig);
pq_putmessage_noblock('d', output_message.data, output_message.len);
I will shortly push a fix adding a wait_for_catchup to the test. I don't know
if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race
condition easier to hit.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-04-14 01:45:16 | Re: 001_rep_changes.pl stalls |
Previous Message | Tom Lane | 2020-04-14 01:34:20 | Re: Race condition in SyncRepGetSyncStandbysPriority |