Re: WAL Archive Cleanup?

From: Foo Bar <qubitrenegade(at)gmail(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: WAL Archive Cleanup?
Date: 2019-03-25 16:21:08
Message-ID: CACLHErTpWzqNkK1vWMU2KvdizRopm23S-CYvOBdKxzX0aMWhTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,

Wow! Lots of awesome replies, Went away for the weekend thinking my email
had been rejected and come back to a full inbox. Thanks for all the help!

>> Postgres version?

9.6.11

Doh. Fairly important detail there. :)

>> FYI, psql is the Postgres client program, Postgres(ql) is the server.
>> "psql" is the name of a specific command line tool used to connect to a
PostgreSQL database server, it is not the name of the database itself. The
database is usually abbreviated "pgsql".

Duly noted, thanks for the correction.

>> It's the standby that has not seen any traffic?

There's really no traffic. I built three nodes, connected them, created a
test table and inserted some values, then left the cluster be for a couple
weeks.

>> And "restartpoint" is usually spelled as one work in technical
discussions of it. Or at least, searching for it that way avoids finding
things which mention each word separately in different senses.

Ah ha. I had seen it that way but thought it was a typo. Thanks for the
clarification!

>> Are you sure it is the archive
directory (/hab/svc/postgresql/data/archive) which is filling up, and not
the live directory (pg_wal or pg_xlog)? This is often a point of confusion.

Right before I sent the mail last week I deleted everything in
/hab/svc/postgresql/data/archive, this morning I'm seeing:

# du -h --max=1 /hab/svc/postgresql/data/
198M /hab/svc/postgresql/data/pgdata
8.9G /hab/svc/postgresql/data/archive
9.1G /hab/svc/postgresql/data/
# du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
177M /hab/svc/postgresql/data/pgdata/pg_xlog/
# ls -lah /hab/svc/postgresql/data/archive/ | wc -l
571

There is no pg_wal directory though (should there be?)

# find /hab/svc/postgresql/ -name '*pg*wal*'
#

>> If the only reason you want an archive is for replication, then use
streaming replication and do away with the archive completely

To be honest, I thought it was required for streaming replication based on
the guides linked above.

>> There are reasons other than replication that one might want to keep a
WAL archive, but those reasons don't seem to apply to you

Like backup maybe? A wholly other topic, we recently had a power outage
and I lost a pgsql node... having an archive would allow me to "replay" any
transactions?

>> What needs to be determined here is why the standby never consumed the
WAL's from the master?
Ok, so it the standby that's the problem.

>> Do you still have the logs from the standby and do they show anything
relevant?

Sure, what am I looking for? I see a bunch of entries like:

2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
estimate=16383 kB
2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 0/8D000028

On the 15th, around when I think I filled the disk, I see a bunch of:

cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/000000010000000400000049': No such file or directory
2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?

Which makes sense since the pgsql service was down.

This appears to be when I recovered the master on Thursday:

cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client= (0:XX000)FATAL:
could not connect to the primary server: could not connect to server:
Connection refused
Is the server running on host "172.16.10.23" and accepting
TCP/IP connections on port 5432?

cp: cannot stat '/00000002.history': No such file or directory
cp: cannot stat '/00000001000000040000004D': No such file or directory
2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
started streaming WAL from primary at 4/4D000000 on timeline 1
2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client= (0:00000)LOG:
restartpoint starting: time
2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client= (0:00000)LOG:
restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s, total=0.007 s;
sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB,
estimate=16384 kB

Then there's a bunch of the same entries where about the only thing (other
than the timestamp) that seems to change is the index in:

2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/90000140
2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/91000028
2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client= (0:00000)LOG:
recovery restart point at 5/91000108

I see effectively the same logs on the other hotstandby pgsql node.

This is my config on my standby node:

listen_addresses = '0.0.0.0'
port = 5432
max_connections = 100
external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
authentication_timeout = 1min
max_files_per_process = 1000
max_locks_per_transaction = 64
logging_collector = on
log_directory = '/hab/svc/postgresql/var/pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
log_min_messages = ERROR
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
data_directory = '/hab/svc/postgresql/data/pgdata'
hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
wal_level = hot_standby
wal_log_hints = 'on'
hot_standby = 'on'
hot_standby_feedback = true
max_wal_senders = 5
max_replication_slots = 5
checkpoint_completion_target = 0.9
max_wal_size = 1GB
min_wal_size = 128MB
wal_keep_segments = 8
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
track_activity_query_size = 2048
track_io_timing=on
dynamic_shared_memory_type = 'none'
archive_mode = 'on'
archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
archive_timeout = '10min'
max_standby_archive_delay = '30s'
synchronous_commit = local
include '/hab/svc/postgresql/config/postgresql.local.conf'

Should I disable archive mode? Even though I'm not currently using it, it
seems like there's a use-case for having it? And if I can configure out
what causing the backup of archive files it should still be manageable?

Thanks again for all the replies, while it hasn't solved the problem yet,
this was incredibly helpful! Also, please don't hate me for munging all
your replies into one reply... I thought it might be easier to follow than
having three different branches...?

Best Regards,
- QBR

On Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
> > archive_cleanup_command is pretty much obsolete. The modern way to do
> this
> > is with streaming replication, using either replication slots or
> > wal_keep_segments. If the only reason you want an archive is for
> > replication, then use streaming replication and do away with the archive
> > completely. There are reasons other than replication that one might want
> > to keep a WAL archive, but those reasons don't seem to apply to you. And
> > if they did you almost certainly wouldn't want to run
> > archive_cleanup_command on it.
>
> Personally, I still find archives also very valuable when a standby
> creation takes a long time because of a large initial base backup and
> that the partition dedicated to pg_wal is not large enough to support
> the retention associated with a slot, and it is easier to have larger
> retention policies in the archives.
> --
> Michael
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Tignor 2019-03-25 18:46:17 postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"
Previous Message Frank 2019-03-25 15:44:21 Re: When to store data that could be derived