Re: Race condition in SyncRepGetSyncStandbysPriority

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Date: 2020-04-11 22:30:30
Message-ID: 5637.1586644230@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
> On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>>> Therefore, the band-aid fix seems to be to set the lowest priority to
>>> very large number at the beginning of SyncRepGetSyncStandbysPriority().

>> I think we can use max_wal_senders.

> Sorry, that's not true. We need another number large enough.

The buildfarm had another three failures of this type today, so that
motivated me to look at it some more. I don't think this code needs
a band-aid fix; I think "nuke from orbit" is more nearly the right
level of response.

The point that I was trying to make originally is that it seems quite
insane to imagine that a walsender's sync_standby_priority value is
somehow more stable than the very existence of the process. Yet we
only require a walsender to lock its own mutex while claiming or
disowning its WalSnd entry (by setting or clearing the pid field).
So I think it's nuts to define those fields as being protected by
the global SyncRepLock.

Even without considering the possibility that a walsender has just
started or stopped, we have the problem Fujii-san described that after
a change in the synchronous_standby_names GUC setting, different
walsenders will update their values of sync_standby_priority at
different instants. (BTW, I now notice that Noah had previously
identified this problem at [1].)

Thus, even while holding SyncRepLock, we do not have a guarantee that
we'll see a consistent set of sync_standby_priority values. In fact
we don't even know that the walsnd array entries still belong to the
processes that last set those values. This is what is breaking
SyncRepGetSyncStandbysPriority, and what it means is that there's
really fundamentally no chance of that function producing trustworthy
results. The "band aid" fixes discussed here might avoid crashing on
the Assert, but they won't fix the problems that (a) the result is
possibly wrong and (b) it can become stale immediately even if it's
right when returned.

Now, there are only two callers of SyncRepGetSyncStandbys:
SyncRepGetSyncRecPtr and pg_stat_get_wal_senders. The latter is
mostly cosmetic (which is a good thing, because to add insult to
injury, it continues to use the list after releasing SyncRepLock;
not that continuing to hold that lock would make things much safer).
If I'm reading the code correctly, the former doesn't really care
exactly which walsenders are sync standbys: all it cares about is
to collect their WAL position pointers.

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not
even need to have that field anymore? Not sure.) We should also
redefine that flag, and sync_standby_priority if it survives, as being
protected by the per-walsender mutex not SyncRepLock. Then, what
SyncRepGetSyncRecPtr would do is just sweep through the walsender
array and collect WAL position pointers from the walsenders that
claim to be sync standbys at the instant that it's inspecting them.
pg_stat_get_wal_senders could also use those flags instead of the
list from SyncRepGetSyncStandbys.

It's likely that this definition would have slightly different
behavior from the current implementation during the period where
the system is absorbing a change in the set of synchronous
walsenders. However, since the current implementation is visibly
wrong during that period anyway, I'm not sure how this could be
worse. And at least we can be certain that SyncRepGetSyncRecPtr
will not include WAL positions from already-dead walsenders in
its calculations, which *is* a hazard in the code as it stands.

I also estimate that this would be noticeably more efficient than
the current code, since the logic to decide who's a sync standby
would only run when we're dealing with walsender start/stop or
SIGHUP, rather than every time SyncRepGetSyncRecPtr runs.

Don't especially want to code this myself, though. Anyone?

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-04-11 22:33:19 Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)
Previous Message Alvaro Herrera 2020-04-11 21:50:56 Re: pg_validatebackup -> pg_verifybackup?