Re: streaming vs wal shipping issue

From: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>
To: Tomáš Uko <uko(at)avast(dot)com>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>, Petr Novák <novakp(at)avast(dot)com>
Subject: Re: streaming vs wal shipping issue
Date: 2019-02-26 13:00:58
Message-ID: CAGDYbUPW7NEoTueYP3eDa14e5Zk0kHCrAjbG4j4trm7MRs09DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Tomáš Uko,

Here you go in detail to understand the steaming vs wal shipping :
"Streaming replication" refers to continuous sending of WAL records over a
TCP/IP connection between the master and the replica, using the walsender
protocol over replication connections. The master reads its own WAL from
pg_xlog and sends it to the replica on demand. It's configured with a
primary_conninfo directive in recovery.conf and pg_hba.conf entries on the
master to permit replicationconnections. You also need wal_keep_segments and
some other options.

"Log shipping" refers to periodic sending of WAL records as whole WAL
archives via a file transfer protocol to an archive location from which the
replica can then fetch them. It's configured with a restore_command directive
in recovery.conf and an archive_command in the master. PostgreSQL doesn't
care where the files are or how they're transferred, only that the
archive_command puts them there and the restore_command fetches the
required archive.

Streaming replication doesn't have as much lag, as records are sent as they
are generated. However, it requires both master and replica to be online
and able to communicate directly. It also requires the replica to keep up
well enough that the master still has on-disk copies of the WAL the replica
needs, and generally requires you to spend extra pg_xlog space on retaining
extra WAL for the replica.

Thanks & Regards,
*Shreeyansh DBA Team*
www.shreeyansh.com

On Tue, Feb 26, 2019 at 2:56 PM Tomáš Uko <uko(at)avast(dot)com> wrote:

> Hi all,
>
>
>
> We have recently found weird issue with WAL being still present after
> recovery (In fact, they are never deleted), this problem is on several
> version (starting from 9.4 up to 9.6, possibly later).
>
> Our setup is this:
>
> - Master in one DC, with wal shipping to different
> storage/server
>
> o Wal shipping via archive_command (pigz and scp - and we check return
> codes J )
>
> - Several replicas in different DCs, configured use streaming
> replication as well as wal restore from that storage with archive_mode =
> 'on'
>
> o snippet of recovery.conf:
>
> standby_mode = 'on'
>
> primary_conninfo = 'host=XY port=5432 user=AAA password=PASSWORD
> sslmode=verify-full'
>
> restore_command = '~/bin/wal_handler.sh -r -h REMOTE_STORAGE -f %f -p %p
> -d LOCATION_ON_REMOTE_STORAGE'
>
>
>
> This custom script does this:
>
> [ -f “PARAM_P/PARAM_F " ] && { echo "Logfile PARAM_P/PARAM_F already
> exists"; }
>
> echo "Restoring xlog PARAM_F "
>
> scp -i $SshKey -o PasswordAuthentication=no -o StrictHostKeyChecking=no
> REMOTE_STORAGE:LOCATION_ON_REMOTE_STORAGE/PARAM_F.gz PARAM_P
>
> [ $? == 0 ] || { echo "Failed to retrieve log PARAM_F from
> REMOTE_STORAGE" && exit 1; }
>
> unpigz -f PARAM_P/PARAM_F.gz
>
> [ $? == 0 ] || { echo "Failed to unpack log to PARAM_P/PARAM_F.gz" &&
> exit 1; }
>
>
>
> When our master became under some massive DML workload (daily bases) some
> of replicas stops using streaming replication and switch to wal restore (so
> far good). Message in logs:
>
>
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
> recovery restart point at CEA9/3007188
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
> last completed transaction was at log time 2019-02-25 22:27:14.059533+01
>
> 2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint starting: xlog
>
> 2019-02-25 23:10:47.832 CET, pid: 2100974, client: , db: , usr: ; FATAL:
> could not receive data from WAL stream: ERROR: requested WAL segment
> 000000020000CEA90000008A has already been removed
>
>
>
> But once they start restoring wals, they never deletes afterwards. We’ve
> discovered when there is restore point, it creates next wal segment in
> sequence (in pg_xlog dir), which is immediately overwritten by restore.
>
> With overwrite enabled logs are full of:
>
> Restoring xlog 000000020000CE3900000029
>
> Restoring xlog 000000020000CE390000002A
>
> Restoring xlog 000000020000CE390000002B
>
> Restoring xlog 000000020000CE390000002C
>
> Restoring xlog 000000020000CE390000002D
>
> Restoring xlog 000000020000CE390000002E
>
> And once if catches up:
>
> scp: /LOCATION_ON_REMOTE_STORAGE/000000020000CE8B00000017.gz: No such file
> or directory
>
> Failed to retrieve log 000000020000CE8B00000017 from REMOTE_STORAGE
>
> 2019-02-25 09:09:28.490 CET, pid: 2090891, client: , db: , usr: ; LOG:
> started streaming WAL from primary at CE8B/17000000 on timeline 2
>
>
>
> When we disabled overriding of WALs during restore we git this error
> message
>
> 2019-02-25 23:10:03.086 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint complete: wrote 142421 buffers (18.1%); 0 transaction log
> file(s) added, 68 removed, 0 recycled; write=24.086 s, sync=0.022 s,
> total=24.424 s; sync files=74, longest=0.005 s, average=0.000 s;
> distance=1114130 kB, estimate=1114148 kB
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
> recovery restart point at CEA9/3007188
>
> 2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
> last completed transaction was at log time 2019-02-25 22:27:14.059533+01
>
> 2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
> restartpoint starting: xlog
>
> Logfile pg_xlog/000000020000CEA90000008A already exists
>
> Restoring xlog 000000020000CEA90000008A
>
> unpigz: abort: write error on pg_xlog/000000020000CEA90000008A
> (Inappropriate ioctl for device)
>
> Which means, wal file is present in this xlog directory
>
>
>
> Thanks for any advice
>
>
>
>
>
>
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Shreeyansh Dba 2019-02-26 13:11:43 Re: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Previous Message Ashok Kumar Tiwari 2019-02-26 10:16:40 org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.