Re: Promoted slave tries to archive previously archived WAL file

From: Andre Piwoni <apiwoni(at)webmd(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Promoted slave tries to archive previously archived WAL file
Date: 2019-02-15 20:03:39
Message-ID: CAEC-+VHAQVCCCiT0pEcg=CZvmXUFeFDGaxGJgFtv5+G7NOqjAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ok. I think I uncovered a bug.

My slave nodes were created using pg_basebackup with --wal-method=stream.
If I understand right this option streams WAL files generated during backup
and this WAL file was 000000010000000000000002 but its contents were
different from what was on the primary and in WAL archive. When I changed
--wal-method=fetch which collects WAL files generated during the backup at
the end then diff did not detect any changes. When I failover I don't seem
to have the issue with new primary to archive.

It seems streaming WAL segments created during backup produced corrupt file
based on diff.

On Fri, Feb 15, 2019 at 10:23 AM Andre Piwoni <apiwoni(at)webmd(dot)net> wrote:

> I have master and slave running with the following contents of their
> pg_wal directories and archivedir:
>
> ls -l /mnt/pgsql/archive/
> -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 000000010000000000000001
> -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 000000010000000000000002
> -rw-rw-rw-. 1 root root 302 Feb 15 09:39
> 000000010000000000000002.00000028.backup
>
> pg-hdp-node1.kitchen.local
> /var/lib/pgsql/10/data/pg_wal/:
> -rw-------. 1 postgres postgres 16777216 Feb 15 09:39
> 000000010000000000000002
> -rw-------. 1 postgres postgres 302 Feb 15 09:39
> 000000010000000000000002.00000028.backup
> -rw-------. 1 postgres postgres 16777216 Feb 15 09:44
> 000000010000000000000003
> -rw-------. 1 postgres postgres 16777216 Feb 15 09:39
> 000000010000000000000004
> drwx------. 2 postgres postgres 96 Feb 15 09:44 archive_status
> /var/lib/pgsql/10/data/pg_wal/archive_status:
> -rw-------. 1 postgres postgres 0 Feb 15 09:39
> 000000010000000000000002.00000028.backup.done
> -rw-------. 1 postgres postgres 0 Feb 15 09:39
> 000000010000000000000002.done
>
> pg-hdp-node2.kitchen.local
> /var/lib/pgsql/10/data/pg_wal/:
> -rw-------. 1 postgres root 16777216 Feb 15 09:39
> 000000010000000000000002
> -rw-------. 1 postgres postgres 16777216 Feb 15 09:44
> 000000010000000000000003
> drwx------. 2 postgres root 6 Feb 15 09:39 archive_status
> /var/lib/pgsql/10/data/pg_wal/archive_status:
>
> diff from secondary pg-hdp-node2.kitchen.local on
> /var/lib/pgsql/10/data/pg_wal/000000010000000000000002 and
> /mnt/pgsql/archive/000000010000000000000002 shows binary differences but as
> expected no differences for diff on primary pg-hdp-node1.kitchen.local
>
> Failover is performed and pg-hdp-node2.kitchen.local tries and fails to
> archive WAL segment 000000010000000000000002 because it has been previously
> archived
> 2019-02-15 09:54:50.518 PST [780] DETAIL: The failed archive command was:
> test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
> pg_wal/000000010000000000000002 /mnt/pgsql/archive/000000010000000000000002
>
> Based on this thread
> https://www.postgresql.org/message-id/11b405a6-2176-9510-bf5b-ea9c0e860635%40pgmasters.net
> it is suggested to handle this case by reporting success but in my case
> contents are different. I would think that previously archived
> 000000010000000000000002 is the right WAL segment.
>
> So my questions are as follows:
>
> Why WAL segments differ?
> How should this be resolved on the new primary?
> --
>
> *Andre Piwoni*
>

--

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rich Shepard 2019-02-15 20:38:59 Re: Subquery to select max(date) value
Previous Message Jeremy Finzel 2019-02-15 19:01:50 Re: Size estimation of postgres core files