Re: Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208

From: Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208
Date: 2019-10-03 07:44:52
Message-ID: CAODqTUZMNQ223Dtr9zJpcMSvZRLRo8qcj2OLbc0_1yFAdZsGGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

čt 3. 10. 2019 v 0:09 odesílatel Stephen Frost <sfrost(at)snowman(dot)net> napsal:

> Greetings,
>
> * Aleš Zelený (zeleny(dot)ales(at)gmail(dot)com) wrote:
> > But recovery on replica failed to proceed WAL file
> > 0000000100000FED00000039 with log message: " invalid contrecord length
> > 1956 at FED/38FFE208".
>
> Err- you've drawn the wrong conclusion from that message (and you're
> certainly not alone- it's a terrible message and we should really have a
> HINT there or something). That's an INFO-level message, not an error,
> and basically just means "oh, look, there's an invalid WAL record, guess
> we got to the end of the WAL available from this source." If you had
> had primary_conninfo configured in your recovery.conf, PG would likely
> have connected to the primary and started replication. One other point
> is that if you actually did a promotion in this process somewhere, then
> you might want to set recovery_target_timeline=latest, to make sure the
> replica follows along on the timeline switch that happens when a
> promotion happens.
>
> Thanks (for all comments form others as well) for this explanation. I've
failed to describe properly our case. We are recovering the replica
instance from WALs only, since the replica is in separate network (used as
source for zfs clonning for development), so there is no primary_conninfo
(replica can't influence primary instance any way, it juts consumes primary
instance WALs) and we did not performed replica promotion.

I'd guess, thath on out of disk space issue, last WAL might be incomplete,
but the size was expected 16777216 Bytes on primary instance disk and it
was binary identical to file restores on replica from backup. The issue
wsas, that replica emit this INFO message, but it was not able to move to
next wal file and started falling behind primary instance.

If the WAL was incomplete during out of space it probably might be
appeneded during instance start ( but I'll doubt incomplete archive_command
to be invoced on incomplete WAL), that is why I have checked the file on
primary (after it was back up&running) with restored one on replica
instance.

In orther words, if this log message will be emmited only once and recovery
continue retoring subsequent WALs, I'll be OK with that, but due to
recovery stucked at this WAL I'm in doubts whether I did something wrong
(e.g. improper recovery.conf ...) or what is possible workaround to enable
replica (if possible) proceed this wal and continue with recovery. The
database size is almost 2 TB, so that is why I'd like to avoid full
restores to create DEV environments and using ZFS clones instead.

Thanks for any hints how to let replica continue applying WAL files.

Kind regards Ales Zeleny

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Nikolai Lusan 2019-10-03 09:05:04 Advice for geographically dispersed multi master
Previous Message Arnaud L. 2019-10-03 06:51:28 psql \copy hanging