From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | masao(dot)fujii(at)oss(dot)nttdata(dot)com |
Cc: | andres(at)anarazel(dot)de, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org, bossartn(at)amazon(dot)com, mengjuan(dot)cmj(at)alibaba-inc(dot)com, Jakub(dot)Wartak(at)tomtom(dot)com |
Subject: | Re: prevent immature WAL streaming |
Date: | 2021-09-02 00:24:15 |
Message-ID: | 20210902.092415.728659185237934198.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>
>
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record. So we should never need to set the flag on
> > an
> > already archived page.
>
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
>
> Yes, this design looks good to me.
So, this is a crude PoC of that.
At the end of recovery:
- When XLogReadRecord misses a page where the next part of the current
continuation record should be seen, xlogreader->ContRecAbortPtr is
set to the beginning of the missing page.
- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
as the next WAL insertion location then sets the same to
XLogCtl->contAbortedRecPtr.
- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
(called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
flag to the page.
While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
rereads a record from there.
In this PoC,
1. This patch is written on the current master, but it doesn't
interfare with the seg-boundary-memorize patch since it removes the
calls to RegisterSegmentBoundary.
2. Since xlogreader cannot emit a log-message immediately, we don't
have a means to leave a log message to inform recovery met an
aborted partial continuation record. (In this PoC, it is done by
fprintf:p)
3. Myebe we need to pg_waldump to show partial continuation records,
but I'm not sure how to realize that.
4. By the way, is this (method) applicable in this stage?
The attached first is the PoC including debug-crash aid. The second
is the repro script. It failes to reproduce the situation once per
several trials.
The following log messages are shown by a run of the script.
> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG: database system is shut down
> ...
>
> LOG: redo starts at 0/2000028
> LOG: redo done at 0/6FFFFA8 system usage:...
> LOG: #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8)
The record from 6FFFFE8 is missing the trailing part after 7000000.
> LOG: #### EndOfLog=0/7000000
> LOG: #### set XLP_FIRST_IS_ABORT_PARTIAL(at)0/7000000
So, WAL insertion starts from 7000000 and the first page is set the flag.
> LOG: database system is ready to accept connections
> ...
> LOG: database system is shut down
> ...
> #########################
> ...
> LOG: redo starts at 0/2000028
> LOG: consistent recovery state reached at 0/2000100
> ...
> LOG: restored log file "000000010000000000000007" from archive
> #### aborted partial continuation record found at 0/6FFFFE8, continue from 0/7000000
The record from 6FFFFE8 is immature so skip it and continue reading
from 7000000.
> LOG: last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG: #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000)
> LOG: restored log file "000000010000000000000007" from archive
> LOG: selected new timeline ID: 2
> LOG: archive recovery complete
> LOG: #### EndOfLog=0/8000000
Recovery ends.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2021-09-02 01:16:40 | Re: Possible missing segments in archiving on standby |
Previous Message | Jacob Champion | 2021-09-02 00:09:49 | Re: [PATCH] test/ssl: rework the sslfiles Makefile target |