Re: streaming vs wal shipping issue

From: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>
To: Tomáš Uko <uko(at)avast(dot)com>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>, Petr Novák <novakp(at)avast(dot)com>
Subject: Re: streaming vs wal shipping issue
Date: 2019-02-26 16:55:35
Message-ID: CAGDYbUNNmD9jvTkSXupS9_9xQebVc38MxU-g+QCZfSCKz+TNzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

PostgreSQL doesn't clean up files copied by your archive_command. You need
to have a separate task clean those out. PostgreSQL's active
wal_keep_segments etc. are in the data/pg_xlog directory.

Please have look at
http://www.postgresql.org/docs/9.4/interactive/pgarchivecleanup.html

Thanks & Regards,
*Shreeyansh DBA Team*
www.shreeyansh.com

On Tue, Feb 26, 2019 at 8:31 PM Tomáš Uko <uko(at)avast(dot)com> wrote:

> Hi,
>
>
>
> Just to clarify, we use both methods, everything is correctly set and when
> there is no issue working … i.e. see this line from log:
>
> 2019-02-25 09:09:28.490 CET, pid: 2090891, client: , db: , usr: ; LOG:
> started streaming WAL from primary at CE8B/17000000 on timeline 2
>
>
>
> But problem appears once there is large enough replication lag, so
> streaming replication just give up and it reverts to replaying WALs (which
> is point of have both methods, so we don’t need to re-initialize
> slave/replica again from master), until it catches up with master again
> (and switches to streaming replication again). Only problem is, that those
> WAL files copied from “master” (technically from shared storage server) are
> never deleted from slave/replica pg_xlog directory (after they are replayed
> on slave/replica).
>
>
>
> Best regards
>
> Tomas Uko
>
>
>
> *From:* Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>
> *Sent:* úterý 26. února 2019 14:01
> *To:* Tomáš Uko <uko(at)avast(dot)com>
> *Cc:* pgsql-admin <pgsql-admin(at)postgresql(dot)org>; Petr Novák <
> novakp(at)avast(dot)com>
> *Subject:* Re: streaming vs wal shipping issue
>
>
>
> Hi Tomáš Uko,
>
>
> Here you go in detail to understand the steaming vs wal shipping :
> "Streaming replication" refers to continuous sending of WAL records over a
> TCP/IP connection between the master and the replica, using the walsender
> protocol over replication connections. The master reads its own WAL from
> pg_xlog and sends it to the replica on demand. It's configured with a
> primary_conninfo directive in recovery.conf and pg_hba.conf entries on
> the master to permit replicationconnections. You also need
> wal_keep_segments and some other options.
>
> "Log shipping" refers to periodic sending of WAL records as whole WAL
> archives via a file transfer protocol to an archive location from which the
> replica can then fetch them. It's configured with a restore_command directive
> in recovery.conf and an archive_command in the master. PostgreSQL doesn't
> care where the files are or how they're transferred, only that the
> archive_command puts them there and the restore_command fetches the
> required archive.
>
> Streaming replication doesn't have as much lag, as records are sent as
> they are generated. However, it requires both master and replica to be
> online and able to communicate directly. It also requires the replica to
> keep up well enough that the master still has on-disk copies of the WAL the
> replica needs, and generally requires you to spend extra pg_xlog space on
> retaining extra WAL for the replica.
>
>
>
> Thanks & Regards,
> *Shreeyansh DBA Team*
> www.shreeyansh.com
>
>
>
>
>
> On Tue, Feb 26, 2019 at 2:56 PM Tomáš Uko <uko(at)avast(dot)com> wrote:
>
> Hi all,
>
>
>
> We have recently found weird issue with WAL being still present after
> recovery (In fact, they are never deleted), this problem is on several
> version (starting from 9.4 up to 9.6, possibly later).
>
> Our setup is this:
>
> - Master in one DC, with wal shipping to different
> storage/server
>
> o Wal shipping via archive_command (pigz and scp - and we check return
> codes J )
>
> - Several replicas in different DCs, configured use streaming
> replication as well as wal restore from that storage with archive_mode =
> 'on'
>
> o snippet of recovery.conf:
>
> standby_mode = 'on'
>
> primary_conninfo = 'host=XY port=5432 user=AAA password=PASSWORD
> sslmode=verify-full'
>
> restore_command = '~/bin/wal_handler.sh -r -h REMOTE_STORAGE -f %f -p %p
> -d LOCATION_ON_REMOTE_STORAGE'
>
>
>
> This custom script does this:
>
> [ -f “PARAM_P/PARAM_F " ] && { echo "Logfile PARAM_P/PARAM_F already
> exists"; }
>
> echo "Restoring xlog PARAM_F "
>
> scp -i $SshKey -o PasswordAuthentication=no -o StrictHostKeyChecking=no
> REMOTE_STORAGE:LOCATION_ON_REMOTE_STORAGE/PARAM_F.gz PARAM_P
>
> [ $? == 0 ] || { echo "Failed to retrieve log PARAM_F from
> REMOTE_STORAGE" && exit 1; }
>
> unpigz -f PARAM_P/PARAM_F.gz
>
> [ $? == 0 ] || { echo "Failed to unpack log to PARAM_P/PARAM_F.gz" &&
> exit 1; }
>
>
>
> When our master became under some massive DML workload (daily bases) some
> of replicas stops using streaming replication and switch to wal restore (so
> far good). Message in logs:
>
>
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
> recovery restart point at CEA9/3007188
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
> last completed transaction was at log time 2019-02-25 22:27:14.059533+01
>
> 2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint starting: xlog
>
> 2019-02-25 23:10:47.832 CET, pid: 2100974, client: , db: , usr: ; FATAL:
> could not receive data from WAL stream: ERROR: requested WAL segment
> 000000020000CEA90000008A has already been removed
>
>
>
> But once they start restoring wals, they never deletes afterwards. We’ve
> discovered when there is restore point, it creates next wal segment in
> sequence (in pg_xlog dir), which is immediately overwritten by restore.
>
> With overwrite enabled logs are full of:
>
> Restoring xlog 000000020000CE3900000029
>
> Restoring xlog 000000020000CE390000002A
>
> Restoring xlog 000000020000CE390000002B
>
> Restoring xlog 000000020000CE390000002C
>
> Restoring xlog 000000020000CE390000002D
>
> Restoring xlog 000000020000CE390000002E
>
> And once if catches up:
>
> scp: /LOCATION_ON_REMOTE_STORAGE/000000020000CE8B00000017.gz: No such file
> or directory
>
> Failed to retrieve log 000000020000CE8B00000017 from REMOTE_STORAGE
>
> 2019-02-25 09:09:28.490 CET, pid: 2090891, client: , db: , usr: ; LOG:
> started streaming WAL from primary at CE8B/17000000 on timeline 2
>
>
>
> When we disabled overriding of WALs during restore we git this error
> message
>
> 2019-02-25 23:10:03.086 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint complete: wrote 142421 buffers (18.1%); 0 transaction log
> file(s) added, 68 removed, 0 recycled; write=24.086 s, sync=0.022 s,
> total=24.424 s; sync files=74, longest=0.005 s, average=0.000 s;
> distance=1114130 kB, estimate=1114148 kB
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
> recovery restart point at CEA9/3007188
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
> last completed transaction was at log time 2019-02-25 22:27:14.059533+01
>
> 2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint starting: xlog
>
> Logfile pg_xlog/000000020000CEA90000008A already exists
>
> Restoring xlog 000000020000CEA90000008A
>
> unpigz: abort: write error on pg_xlog/000000020000CEA90000008A
> (Inappropriate ioctl for device)
>
> Which means, wal file is present in this xlog directory
>
>
>
> Thanks for any advice
>
>
>
>
>
>
>
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Pepe TD Vo 2019-02-26 17:46:38 Re: [MASSMAIL]Re: pg_dumpall: could not connect to database: FATAL
Previous Message Shreeyansh Dba 2019-02-26 16:00:27 Re: [MASSMAIL]Re: pg_dumpall: could not connect to database: FATAL