From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | david(at)pgmasters(dot)net |
Cc: | peter(dot)eisentraut(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, pgsql-hackers(at)lists(dot)postgresql(dot)org, jtc331(at)gmail(dot)com, robertmhaas(at)gmail(dot)com |
Subject: | Re: Make mesage at end-of-recovery less scary. |
Date: | 2021-03-04 06:50:39 |
Message-ID: | 20210304.155039.2020434514871976864.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 3 Mar 2021 11:14:20 -0500, David Steele <david(at)pgmasters(dot)net> wrote in
> Hi Kyotaro,
>
> On 3/27/20 10:25 PM, James Coleman wrote:
> > On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas(at)gmail(dot)com>
> > wrote:
> >>
> >> I'm just spitballing here, but it would be really good if there's a
> >> way to know definitely whether or not you should be scared. Corrupted
> >> WAL segments are definitely a thing that happens, but retries are a
> >> lot more common.
> > First, I agree that getting enough context to say precisely is by far
> > the ideal.
> > That being said, as an end user who's found this surprising -- and
> > momentarily scary every time I initially scan it even though I *know
> > intellectually it's not* -- I would find Peter's suggestion a
> > significant improvement over what we have now. I'm fairly certainly my
> > co-workers on our database team would also. Knowing that something is
> > at least not always scary is good. Though I'll grant that this does
> > have the negative in reverse: if it actually is a scary
> > situation...this mutes your concern level. On the other hand,
> > monitoring would tell us if there's a real problem (namely replication
> > lag), so I think the trade-off is clearly worth it.
> > How about this minor tweak:
> > HINT: This is expected if this is the end of currently available WAL.
> > Otherwise, it could indicate corruption.
>
> Any thoughts on the suggestions for making the messaging clearer?
>
> Also, the patch no longer applies:
> http://cfbot.cputube.org/patch_32_2490.log.
Sorry for missing the last discussions. I agree to the point about
really-scary situation.
ValidXLogRecordHeader deliberately marks End-Of-WAL only in the case
of zero-length record so that the callers can identify that case,
instead of inferring the EOW state without it. All other invalid data
is treated as potentially danger situation. I think this is a
reasonable classification. And the error level for the "danger" cases
is changed to WARNING (from LOG).
As the result, the following messages are emitted with the attached.
- found zero-length record during recovery (the DETAIL might not be needed.)
> LOG: redo starts at 0/14000118
> LOG: reached end of WAL at 0/14C5D070 on timeline 1 in pg_wal during crash recovery
> DETAIL: record length is 0 at 0/14C5D070
> LOG: redo done at 0/14C5CF48 system usage: ...
- found another kind of invalid data
> LOG: redo starts at 0/150000A0
> WARNING: invalid record length at 0/1500CA60: wanted 24, got 54
> LOG: redo done at 0/1500CA28 system usage: ...
On the way checking the patch, I found that it emits the following log
lines in the case the redo loop meets an invalid record at the
starting:
> LOG: invalid record length at 0/10000118: wanted 24, got 42
> LOG: redo is not required
which doesn't look proper. That case is identifiable using the
End-of_WAL flag this patch adds. Thus we get the following error
messages.
- found end-of-wal at the beginning of recovery
> LOG: reached end of WAL at 0/130000A0 on timeline 1 in pg_wal during crash recovery
> DETAIL: record length is 0 at 0/130000A0
> LOG: redo is not required
- found invalid data
> WARNING: invalid record length at 0/120000A0: wanted 24, got 42
> WARNING: redo is skipped
> HINT: This suggests WAL file corruption. You might need to check the database.
The logic of ErrRecPtr in ReadRecord may wrong. I remember having
such an discussion before...
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Make-End-Of-Recovery-error-less-scary.patch | text/x-patch | 8.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2021-03-04 06:51:40 | Re: Improvements and additions to COPY progress reporting |
Previous Message | Bharath Rupireddy | 2021-03-04 06:46:17 | Re: Improvements and additions to COPY progress reporting |