From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | noah(at)leadboat(dot)com |
Cc: | tgl(at)sss(dot)pgh(dot)pa(dot)us, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: 001_rep_changes.pl stalls |
Date: | 2020-04-17 08:06:29 |
Message-ID: | 20200417.170629.1231777174468368420.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Sorry , I wrote something wrong.
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> > > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > > This seems to have made the following race condition easier to hit:
> > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> > >
> > > Yeah, I just came to the same guess in the other thread.
> > >
> > > > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > > > creating more buildfarm noise is itself bad. I am inclined to revert the fix
> > > > after a week. Not immediately, in case it uncovers lower-probability bugs.
> > > > I'd then re-commit it after one of those threads fixes the other bug. Would
> > > > anyone like to argue for a revert earlier, later, or not at all?
> > >
> > > I don't think you should revert. Those failures are (just) often enough
> > > to be annoying but I do not think that a proper fix is very far away.
> >
> > That works for me, but an actual defect may trigger a revert. Fujii Masao
> > reported high walsender CPU usage after this patch. The patch caused idle
> > physical walsenders to use 100% CPU. When caught up, the
> > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
> > until more WAL is available. XLogSendPhysical() just returns when caught up.
> > No amount of WAL is too small for physical replication to dispatch, but
> > logical replication needs the full xl_tot_len of a record. Some options:
> >
> > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
> > WalSndWaitForWal() when no WAL is available. A quick version of this
> > passes tests, but I'll need to audit WalSndWaitForWal() for things that are
> > wrong for physical replication.
> >
> > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
> > insufficient WAL is available. This complicates the xlogreader.h API to
> > pass back "wait for this XLogRecPtr", and we'd then persist enough state to
> > resume decoding. This doesn't have any advantages to make up for those.
> >
> > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
> > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
> > further drift between the two wait sites; on the other hand, one could
> > refactor later to help avoid that.
> >
> > 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
> > minimal fix, but it crudely punches through the abstraction between
> > WalSndLoop() and its WalSndSendDataCallback.
> >
> > I'm favoring (1). Other preferences?
>
> Starting from the current shape, I think 1 is preferable, since that
> waiting logic is no longer shared between logical and physical
> replications. But I'm not sure I like calling WalSndWaitForWal()
> (maybe with previous location + 1?), because the function seems to do
> too-much.
>
> By the way, if latch is consumed in WalSndLoop, succeeding call to
> WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
> cause missing wakeups? (in other words, overlooking of wakeup latch).
- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).
+ Since walsender is wokeup by LSN advancement via latch, we should
+ avoid wasteful consuming of latch. (It is the same issue with [1]).
> If wakeup signal is not remembered on walsender (like
> InterruptPending), WalSndPhysical cannot enter a sleep with
> confidence.
>
>
> [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2020-04-17 08:07:06 | Re: Race condition in SyncRepGetSyncStandbysPriority |
Previous Message | Masahiko Sawada | 2020-04-17 08:03:11 | Re: Race condition in SyncRepGetSyncStandbysPriority |