streaming vs wal shipping issue

From: Tomáš Uko <uko(at)avast(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Cc: Petr Novák <novakp(at)avast(dot)com>
Subject: streaming vs wal shipping issue
Date: 2019-02-26 09:26:37
Message-ID: 238227add62b080920c485b04ed4c341@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi all,

We have recently found weird issue with WAL being still present after
recovery (In fact, they are never deleted), this problem is on several
version (starting from 9.4 up to 9.6, possibly later).

Our setup is this:

- Master in one DC, with wal shipping to different storage/server

o Wal shipping via archive_command (pigz and scp - and we check return
codes J )

- Several replicas in different DCs, configured use streaming
replication as well as wal restore from that storage with archive_mode =
'on'

o snippet of recovery.conf:

standby_mode = 'on'

primary_conninfo = 'host=XY port=5432 user=AAA password=PASSWORD
sslmode=verify-full'

restore_command = '~/bin/wal_handler.sh -r -h REMOTE_STORAGE -f %f -p %p -d
LOCATION_ON_REMOTE_STORAGE'

This custom script does this:

[ -f “PARAM_P/PARAM_F " ] && { echo "Logfile PARAM_P/PARAM_F already
exists"; }

echo "Restoring xlog PARAM_F "

scp -i $SshKey -o PasswordAuthentication=no -o StrictHostKeyChecking=no
REMOTE_STORAGE:LOCATION_ON_REMOTE_STORAGE/PARAM_F.gz PARAM_P

[ $? == 0 ] || { echo "Failed to retrieve log PARAM_F from
REMOTE_STORAGE" && exit 1; }

unpigz -f PARAM_P/PARAM_F.gz

[ $? == 0 ] || { echo "Failed to unpack log to PARAM_P/PARAM_F.gz" &&
exit 1; }

When our master became under some massive DML workload (daily bases) some
of replicas stops using streaming replication and switch to wal restore (so
far good). Message in logs:

2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
recovery restart point at CEA9/3007188

2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
last completed transaction was at log time 2019-02-25 22:27:14.059533+01

2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
restartpoint starting: xlog

2019-02-25 23:10:47.832 CET, pid: 2100974, client: , db: , usr: ; FATAL:
could not receive data from WAL stream: ERROR: requested WAL segment
000000020000CEA90000008A has already been removed

But once they start restoring wals, they never deletes afterwards. We’ve
discovered when there is restore point, it creates next wal segment in
sequence (in pg_xlog dir), which is immediately overwritten by restore.

With overwrite enabled logs are full of:

Restoring xlog 000000020000CE3900000029

Restoring xlog 000000020000CE390000002A

Restoring xlog 000000020000CE390000002B

Restoring xlog 000000020000CE390000002C

Restoring xlog 000000020000CE390000002D

Restoring xlog 000000020000CE390000002E

And once if catches up:

scp: /LOCATION_ON_REMOTE_STORAGE/000000020000CE8B00000017.gz: No such file
or directory

Failed to retrieve log 000000020000CE8B00000017 from REMOTE_STORAGE

2019-02-25 09:09:28.490 CET, pid: 2090891, client: , db: , usr: ; LOG:
started streaming WAL from primary at CE8B/17000000 on timeline 2

When we disabled overriding of WALs during restore we git this error message

2019-02-25 23:10:03.086 CET, pid: 2100971, client: , db: , usr: ; LOG:
restartpoint complete: wrote 142421 buffers (18.1%); 0 transaction log
file(s) added, 68 removed, 0 recycled; write=24.086 s, sync=0.022 s,
total=24.424 s; sync files=74, longest=0.005 s, average=0.000 s;
distance=1114130 kB, estimate=1114148 kB

2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; LOG:
recovery restart point at CEA9/3007188

2019-02-25 23:10:03.087 CET, pid: 2100971, client: , db: , usr: ; DETAIL:
last completed transaction was at log time 2019-02-25 22:27:14.059533+01

2019-02-25 23:10:36.829 CET, pid: 2100971, client: , db: , usr: ; LOG:
restartpoint starting: xlog

Logfile pg_xlog/000000020000CEA90000008A already exists

Restoring xlog 000000020000CEA90000008A

unpigz: abort: write error on pg_xlog/000000020000CEA90000008A
(Inappropriate ioctl for device)

Which means, wal file is present in this xlog directory

Thanks for any advice

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Ashok Kumar Tiwari 2019-02-26 10:16:40 org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Previous Message gilberto.castillo 2019-02-25 18:58:06 Re: [MASSMAIL]Re: pg_dumpall: could not connect to database: FATAL