Re: Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: sfrost(at)snowman(dot)net
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, zeleny(dot)ales(at)gmail(dot)com, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208
Date: 2019-11-14 05:28:03
Message-ID: 20191114.142803.1155800695686449695.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello.

At Wed, 2 Oct 2019 19:24:02 -0400, Stephen Frost <sfrost(at)snowman(dot)net> wrote in
> Greetings,
>
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > > * Aleš Zelený (zeleny(dot)ales(at)gmail(dot)com) wrote:
> > >> But recovery on replica failed to proceed WAL file
> > >> 0000000100000FED00000039 with log message: " invalid contrecord length
> > >> 1956 at FED/38FFE208".
> >
> > > Err- you've drawn the wrong conclusion from that message (and you're
> > > certainly not alone- it's a terrible message and we should really have a
> > > HINT there or something).
> >
> > Yeah, those messages are all pretty ancient, from when WAL was new and not
> > to be trusted much. Perhaps the thing to do is move the existing info
> > into DETAIL and make the primary message be something like "reached
> > apparent end of WAL stream".
>
> Yes, +1 on that.

The attached is something like that. AFAICS we can assume that an
invalid record means the end of WAL as long as we are fetching
successive records and XLogPageRead() has not logged anything.

As the current comment in ReadRecord says, we don't have a message if
standby has been triggered, but that's not always the case. Still we
may have a message if new segment is available after triggering.

I used a kind-of-tricky way to handle optional errdetail but it may be
better to have some additional feature as an ereport
subfunction. (Maybe named "errif" or something?)

I think it works fine as expected but I find one arguable behavior.

In the first patch, the "end of WAL" message is accompanied by source,
LSN and timeline.

LOG: reached end of WAL in archive at 0/A3000060 on timeline 8
DETAIL: invalid record length at 0/A3000060: wanted 24, got 0

The last two just seems good but the first one shows "archive" even on
promotion. It is right that we scan archive after promotion but seems
confusing. So I attached another patch that shows operation instead of
WAL source.

LOG: reached end of WAL during streaming reaplication at 0/A5000060 on timeline 8
DETAIL: invalid record length at 0/A5000060: wanted 24, got 0

What do you think about this?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
log_end_of_wal.patch text/x-patch 3.6 KB
log_end_of_wal_2.patch text/x-patch 3.6 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Abhijit Gharami 2019-11-14 05:38:18 root page 3 of index "pg_class_oid_index" has level 0, expected 1
Previous Message Gerrit Fouche 2019-11-14 05:18:18 Re: terminated by signal 11: Segmentation fault