From: | Alex Malek <magicagent(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: bad wal on replica / incorrect resource manager data checksum in record / zfs |
Date: | 2020-04-02 17:44:57 |
Message-ID: | CAGH8ccfa3fPoT0TizkrQ3Z4gz5XJi+pSBqN8CHUAHmqWEcf0zA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 19, 2020 at 4:35 PM Alex Malek <magicagent(at)gmail(dot)com> wrote:
>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file
> had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file
> systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> and to https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> ....
> <snip>
>
An update in case someone else encounters the same issue.
About 5 weeks ago, on the master database server, we turned off ZFS
compression for the volume where the WAL log resides.
The error has not occurred on any replica since.
Best,
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2020-04-02 17:50:28 | Re: Proposal: Expose oldest xmin as SQL function for monitoring |
Previous Message | Alvaro Herrera | 2020-04-02 17:33:18 | Re: Should we add xid_current() or a int8->xid cast? |