Quick Links

Re: Race conditions with checkpointer and shutdown

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Race conditions with checkpointer and shutdown
Date:	2019-06-13 06:01:23
Message-ID:	20190613060123.GC1643@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage. It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
> AFAICS this is wholly unrelated to the problem at hand.

Yes, that's unrelated.

This failure is interesting, still it has happened only once per what
I can see. I think that this points out a rare race condition in test
007_sync_rep.pl because of this sequence which reorders the standbys:
# Stop and start standbys to rearrange the order of standbys
# in WalSnd array. Now, if standbys have the same priority,
# standby2 is selected preferentially and standby3 is next.
$node_standby_1->stop;
$node_standby_2->stop;
$node_standby_3->stop;

$node_standby_2->start;
$node_standby_3->start;

The failure actually indicates that even if standby3 has started
after standby2, standby3 has registered back into the WAL sender array
of the primary before standby2 had the occasion to do it, but we have
no guarantee that the ordering is actually right. So this has messed
up with the expected set of sync standbys when these are not directly
listed. I think that this can happen as well when initializing the
standbys at the top of the test, but the window is much, much
narrower and basically impossible to reach.

Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes
the test faster, but that's much more costly than using just
poll_query_until() on pg_stat_replication to make sure that each
standby is registered on the primary's WAL sender array. In short,
something like the attached should improve the stability of the test.

Thoughts?
--
Michael

Attachment	Content-Type	Size
syncrep-stability.patch	text/x-diff	2.7 KB

In response to

Re: Race conditions with checkpointer and shutdown at 2019-06-12 17:42:01 from Alvaro Herrera

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2019-06-13 06:11:15	Re: [PATCH] vacuumlo: print the number of large objects going to be removed
Previous Message	Tsunakawa, Takayuki	2019-06-13 05:57:50	RE: [PATCH] Speedup truncates of relation forks