Re: Unnecessary delay in streaming replication due to replay lag

From: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "lchch1990(at)sina(dot)cn" <lchch1990(at)sina(dot)cn>, Asim Praveen <pasim(at)vmware(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "Hao Wu (Pivotal)" <hawu(at)pivotal(dot)io>, "ahsan(dot)hadi" <ahsan(dot)hadi(at)highgo(dot)ca>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Ashwin Agrawal <ashwinstar(at)gmail(dot)com>
Subject: Re: Unnecessary delay in streaming replication due to replay lag
Date: 2021-11-09 23:41:09
Message-ID: CAE-ML+-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michael,

Thanks for the detailed review! Attached is a rebased patch that addresses
most of the feedback.

On Mon, Nov 8, 2021 at 1:41 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> +static void
> +StartWALReceiverEagerly()
> +{
> The patch fails to apply because of the recent changes from Robert to
> eliminate ThisTimeLineID. The correct thing to do would be to add one
> TimeLineID argument, passing down the local ThisTimeLineID in
> StartupXLOG() and using XLogCtl->lastReplayedTLI in
> CheckRecoveryConsistency().

Rebased.

> + /*
> + * We should never reach here. We should have at least one valid
WAL
> + * segment in our pg_wal, for the standby to have started.
> + */
> + Assert(false);
> The reason behind that is not that we have a standby, but that we read
> at least the segment that included the checkpoint record we are
> replaying from, at least (it is possible for a standby to start
> without any contents in pg_wal/ as long as recovery is configured),
> and because StartWALReceiverEagerly() is called just after that.

Fair, comment updated.

> It would be better to make sure that StartWALReceiverEagerly() gets
> only called from the startup process, perhaps?

Added Assert(AmStartupProcess()) at the beginning of
StartWALReceiverEagerly().

>
> + RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
> + PrimarySlotName,
wal_receiver_create_temp_slot);
> + XLogReaderFree(state);
> XLogReaderFree() should happen before RequestXLogStreaming(). The
> tipping point of the patch is here, where the WAL receiver is started
> based on the location of the first valid WAL record found.

Done.

> wal_receiver_start_condition is missing in postgresql.conf.sample.

Fixed.

> + /*
> + * Start WAL receiver eagerly if requested.
> + */
> + if (StandbyModeRequested && !WalRcvStreaming() &&
> + PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
> + wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
> + StartWALReceiverEagerly();
> [...]
> + if (StandbyModeRequested && !WalRcvStreaming() &&
reachedConsistency &&
> + PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
> + wal_receiver_start_condition ==
WAL_RCV_START_AT_CONSISTENCY)
> + StartWALReceiverEagerly();
> This repeats two times the same set of conditions, which does not look
> like a good idea to me. I think that you'd better add an extra
> argument to StartWALReceiverEagerly to track the start timing expected
> in this code path, that will be matched with the GUC in the routine.
> It would be better to document the reasons behind each check done, as
> well.

Done.

> So, this reads the contents of pg_wal/ for any files that exist, then
> goes down to the first segment found with a valid beginning. That's
> going to be expensive with a large max_wal_size. When searching for a
> point like that, a dichotomy method would be better to calculate a LSN
> you'd like to start from.

Even if there is a large max_wal_size, do we expect that there will be
a lot of invalid high-numbered WAL files? If that is not the case, most
of the time we would be looking at the last 1 or 2 WAL files to
determine the start point, making it efficient?

> Anyway, I think that there is a problem
> with the approach: what should we do if there are holes in the
> segments present in pg_wal/? As of HEAD, or
> wal_receiver_start_condition = 'exhaust' in this patch, we would
> switch across local pg_wal/, archive and stream in a linear way,
> thanks to WaitForWALToBecomeAvailable(). For example, imagine that we
> have a standby with the following set of valid segments, because of
> the buggy way a base backup has been taken:
> 000000010000000000000001
> 000000010000000000000003
> 000000010000000000000005
> What the patch would do is starting a WAL receiver from segment 5,
> which is in contradiction with the existing logic where we should try
> to look for the segment once we are waiting for something in segment
> 2. This would be dangerous once the startup process waits for some
> WAL to become available, because we have a WAL receiver started, but
> we cannot fetch the segment we have. Perhaps a deployment has
> archiving, in which case it would be able to grab segment 2 (if no
> archiving, recovery would not be able to move on, so that would be
> game over).

We could easily check for holes while we are doing the ReadDir() and
bail fron the early start if there are holes, just like we do if there
is a timeline jump in any of the WAL segments.

> /*
> * Move to XLOG_FROM_STREAM state, and set to start a
> - * walreceiver if necessary.
> + * walreceiver if necessary. The WAL receiver may have
> + * already started (if it was configured to start
> + * eagerly).
> */
> currentSource = XLOG_FROM_STREAM;
> - startWalReceiver = true;
> + startWalReceiver = !WalRcvStreaming();
> break;
> case XLOG_FROM_ARCHIVE:
> case XLOG_FROM_PG_WAL:
>
> - /*
> - * WAL receiver must not be running when reading WAL from
> - * archive or pg_wal.
> - */
> - Assert(!WalRcvStreaming());
>
> These parts should IMO not be changed. They are strong assumptions we
> rely on in the startup process, and this comes down to the fact that
> it is not a good idea to mix a WAL receiver started while
> currentSource could be pointing at a WAL source completely different.
> That's going to bring a lot of racy conditions, I am afraid, as we
> rely on currentSource a lot during recovery, in combination that we
> expect the code to be able to retrieve WAL in a linear fashion from
> the LSN position that recovery is looking for.
>
> So, I think that deciding if a WAL receiver should be started blindly
> outside of the code path deciding if the startup process is waiting
> for some WAL is not a good idea, and the position we may begin to
> stream from may be something that we may have zero need for at the
> end (this is going to be tricky if we detect a TLI jump while
> replaying the local WAL, also?). The issue is that I am not sure what
> a good design for that should be. We have no idea when the startup
> process will need WAL from a different source until replay comes
> around, but what we want here is to anticipate othis LSN :)

Can you elaborate on the race conditions that you are thinking about?
Do the race conditions manifest only when we mix archiving and streaming?
If yes, how do you feel about making the GUC a no-op with a WARNING
while we are in WAL archiving mode?

> I am wondering if there should be a way to work out something with the
> control file, though, but things can get very fancy with HA
> and base backup deployments and the various cases we support thanks to
> the current way recovery works, as well. We could also go simpler and
> rework the priority order if both archiving and streaming are options
> wanted by the user.

Agreed, it would be much better to depend on the state in pg_wal,
namely the files that are available there.

Reworking the priority order seems like an appealing fix - if we can say
streaming > archiving in terms of priority, then the race that you are
referring to will not happen?

Also, what are some use cases where one would give priority to streaming
replication over archive recovery, if both sources have the same WAL
segments?

Regards,
Ashwin & Soumyadeep

Attachment Content-Type Size
v3-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch text/x-patch 17.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-11-09 23:42:07 Re: row filtering for logical replication
Previous Message Peter Smith 2021-11-09 23:40:49 Re: row filtering for logical replication