Re: WAL Archive command.

From: John Britto <john(at)sodexis(dot)com>
To: Scott Mead <scottm(at)openscg(dot)com>
Cc: Jerry Sievers <gsievers19(at)comcast(dot)net>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: WAL Archive command.
Date: 2017-09-28 07:32:34
Message-ID: CAExAUcVUWa+=1D3bDuh9nFokbpzKTxmLYSnvQFWWQ68_+vP88g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I am running streaming replication with the archive.

As you can see below that Master pg_xlog is at WAL:
000000010000013300000093. But archive_status shows way behind:
000000010000013300000088.done
What could be the reason behind this? How should I let the PostgreSQL
archive the WAL from 000000010000013300000089 to 000000010000013300000092.

pg_xlog/
----------------------------------
-rw------- 1 postgres postgres 16777216 Sep 27 23:30
000000010000013300000082
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000083
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000084
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000085
-rw------- 1 postgres postgres 16777216 Sep 27 23:51
000000010000013300000086
-rw------- 1 postgres postgres 16777216 Sep 28 00:47
000000010000013300000087
-rw------- 1 postgres postgres 16777216 Sep 28 01:55
000000010000013300000088
-rw------- 1 postgres postgres 16777216 Sep 28 02:47
000000010000013300000089
-rw------- 1 postgres postgres 16777216 Sep 27 18:04
00000001000001330000008A
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008B
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008C
-rw------- 1 postgres postgres 16777216 Sep 27 18:05
00000001000001330000008D
-rw------- 1 postgres postgres 16777216 Sep 27 18:03
00000001000001330000008E
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008F
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
000000010000013300000090
-rw------- 1 postgres postgres 16777216 Sep 27 18:37
000000010000013300000091
-rw------- 1 postgres postgres 16777216 Sep 27 20:21
000000010000013300000092
-rw------- 1 postgres postgres 16777216 Sep 27 21:00
000000010000013300000093

pg_xlog/archive_status
----------------------------------
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000081.done
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000082.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000083.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000084.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000085.done
-rw------- 1 postgres postgres 0 Sep 27 23:51 000000010000013300000086.done
-rw------- 1 postgres postgres 0 Sep 28 00:47 000000010000013300000087.done
-rw------- 1 postgres postgres 0 Sep 28 01:55 000000010000013300000088.done

postgresql.conf
#----------------------------------
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode = on
archive_command = '/opt/wal_archive.sh "%p" "%f" > archive_command.log 2>&1'

wal_archive.sh
------------------------------------
#!/bin/bash -xv
PG_XLOG="$1"
PG_XLOG_FILENAME="$2"
HOST=hostname
ARCHIVE_DIR="/pg_archive/master"
ARCHIVE_TO_KEEP="3" #No of days of archive logs to keep
EMAIL="a(at)b(dot)com"
ERROR_COUNT_FILE="/tmp/replication_archive_error_count.txt"

OLD_COUNT=`cat ${ERROR_COUNT_FILE}`
DNOW=`date +%u`
hour=$(date +%H)
D=`date`

#Do the cleanup if the day is Monday or Thursday and time is between 11
p.m. UTC and 22 hrs UTC
if [ "$DNOW" -eq "1" -o "$DNOW" -eq "4" -a "$hour" -ge 11 -a "$hour" -lt 22
]; then
find "${ARCHIVE_DIR}"/ -type f -mtime +"${ARCHIVE_TO_KEEP}" -exec rm -f
{} +
if [ "$?" -eq "1" ]; then
echo "The wal_archive script could not cleanup the archive
directory of $HOST" | mail -s "ERROR - WAL Archive for $HOST" "$EMAIL"
fi
fi

if [ ! -f "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" ]; then
cp "${PG_XLOG}" "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
/usr/bin/rsync -W -az "${PG_XLOG}" postgres(at)standby
:"${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
if [ "$?" -ne "0" ]; then
#If rsync fails, then remove the copied file from master, increase
the error count, and retry.
rm -rf "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
NEW_COUNT=`expr $OLD_COUNT + 1`
if [ "$NEW_COUNT" -ge "100" ]; then
echo -e "${D}""\n""Rsync could not transfer the WAL file from
Master to slave." | mail -s "ALERT - WAL Archive for $HOST" "$EMAIL"
echo "0" > $ERROR_COUNT_FILE
else
echo "$NEW_COUNT" > $ERROR_COUNT_FILE
fi
exit 1
else
echo "0" > $ERROR_COUNT_FILE
exit 0
fi
else
exit 0
fi

John Britto, M.Sc.
DevOps Engineer
Sodexis, Inc.
www.sodexis.com
M: +91-0-8012186991
<john(at)sodexis(dot)com>

On Thu, Sep 28, 2017 at 1:05 AM, Scott Mead <scottm(at)openscg(dot)com> wrote:

>
>
> On Wed, Sep 27, 2017 at 2:55 PM, Jerry Sievers <gsievers19(at)comcast(dot)net>
> wrote:
>
>> John Britto <john(at)sodexis(dot)com> writes:
>>
>> > Hello,
>> >
>> > I have a streaming replication setup along with WAL archive.
>> >
>> > archive_command = ‘test ! -f /var/pg_archive/%f && cp %p <archive
>> > location>%f && scp %p postgres(at)192(dot)168(dot)0(dot)123:<archive location>/%f'
>> >
>> > When the SCP command fails, the master repeatedly tries to send the
>> > archived WAL to standby. But during this time, the pg_xlog directly
>> > grew with newer WAL files.
>> >
>> > The streaming replication hasn't had the problem because on my check,
>> > the WAL write location on the primary was same with the last WAL
>> > location received/replayed in standby.
>> >
>> > Since the pg_xlog in the master had few newer WAL files, the master
>> > archive is lagging to pick the current pg_xlog WAL file. When a new
>> > WAL occur in the pg_xlog, Master picks the old WAL file to send to
>> > the standby.
>>
>> Yup Pg is going to handle the unshipped WALs one at a time and it will
>> do them in order, oldest (lowest file name) first.
>>
>> > How should I force the PostgreSQL to batch copy the lagging WAL files
>> > to pg_archive and then send to standby. Can I do this manually using
>> > rsync? I wonder how PostgreSQL knows the changes because it
>> > maintains info in archive_status with extension as .ready and .done.
>>
>> I suggest you fix your basic archiving routine to complete and exit with
>> success to postgres.
>>
>
> +1
>
> scp %p host:/archive/%f
> if [ "$?" -ne 0 ]
> then
> echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
> cp %p /localarchive/%f
> exit 0
> fi
>
> Essentially, always make sure that you are returning a 0 to postgres.
> If there is a failure, either log it or handle it separately. This code
> snippet is *NOT COMPLETE, *there's a lot more to do in order to make it
> production ready and recoverable. The biggest issue I've had with scp is
> that you have to set and enforce a timeout and trap the timeout. Note, the
> above only works until your local machine (or the /localarchive partition)
> runs out of space. It's *really* important that you have ultra solid
> logging and alerting around this.
>
>
>
>
>> And as for archive command scripts in general, simpler is better.
>>
>> If you want to manually ship them in bulk, you may do so but then will
>> need to remove the corresponding archive_status/$foo.ready file so that
>> postgres won't keep trying to ship the same one.
>>
>
> I'm a huge fan of this strategy, especially if you're sending to a remote
> datacenter.
>
> archive_command.sh:
>
> cp %p /localarchive/%f
> if [ "$?" -ne 0 ]
> then
> echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
> exit 0
> fi
>
> send_archive_to_remote.sh
> rsync -avzP /localarchive/* host:/archive/
>
>
> Of course, now you have to deal with files that are removed from the slave
> and making sure they get removed from the master appropriately, but, this
> is fairly straightforward.
>
> --Scott
>
>
>
>
>> HTH
>>
>> > Please assist.
>> >
>> > Thanks,
>> >
>> > John Britto
>> >
>> >
>> >
>>
>> --
>> Jerry Sievers
>> Postgres DBA/Development Consulting
>> e: postgres(dot)consulting(at)comcast(dot)net
>> p: 312.241.7800
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> --
> Scott Mead
> Sr. Architect
> *OpenSCG <http://openscg.com>*
> http://openscg.com
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message francis cherat 2017-09-28 07:43:57 many many open files on pgsql_tmp with size 0
Previous Message Andres Freund 2017-09-28 02:08:08 Re: Logical Replication - test_decoding - unchanged-toast-datum