Re: WAL Archive Cleanup?

From: Foo Bar <qubitrenegade(at)gmail(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: WAL Archive Cleanup?
Date: 2019-04-01 16:24:02
Message-ID: CACLHErQOETLJowV05y07vKjSrq7YKE5KDQxotSTNzy0AKhsB6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,

Ok, so maybe something helpful? On my master node I am seeing a bunch of:

2019-03-28 23:54:44 GMT [2611]: [1-1]
user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
database "admin" does not exist
2019-03-28 23:54:46 GMT [2613]: [1-1]
user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
(0:3D000)FATAL: database "admin" does not exist

Which is accurate, as there is no admin database... I usually connect with

psql -h localhost -U admin postgres

Should there be? Will this fix my issue with pgsql filling up the disk?

Thanks,
- QBR

On Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade(at)gmail(dot)com> wrote:

> Hello All,
>
> Wow! Lots of awesome replies, Went away for the weekend thinking my email
> had been rejected and come back to a full inbox. Thanks for all the help!
>
> >> Postgres version?
>
> 9.6.11
>
> Doh. Fairly important detail there. :)
>
> >> FYI, psql is the Postgres client program, Postgres(ql) is the server.
> >> "psql" is the name of a specific command line tool used to connect to
> a PostgreSQL database server, it is not the name of the database itself.
> The database is usually abbreviated "pgsql".
>
> Duly noted, thanks for the correction.
>
> >> It's the standby that has not seen any traffic?
>
> There's really no traffic. I built three nodes, connected them, created a
> test table and inserted some values, then left the cluster be for a couple
> weeks.
>
> >> And "restartpoint" is usually spelled as one work in technical
> discussions of it. Or at least, searching for it that way avoids finding
> things which mention each word separately in different senses.
>
> Ah ha. I had seen it that way but thought it was a typo. Thanks for the
> clarification!
>
> >> Are you sure it is the archive
> directory (/hab/svc/postgresql/data/archive) which is filling up, and not
> the live directory (pg_wal or pg_xlog)? This is often a point of confusion.
>
> Right before I sent the mail last week I deleted everything in
> /hab/svc/postgresql/data/archive, this morning I'm seeing:
>
> # du -h --max=1 /hab/svc/postgresql/data/
> 198M /hab/svc/postgresql/data/pgdata
> 8.9G /hab/svc/postgresql/data/archive
> 9.1G /hab/svc/postgresql/data/
> # du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
> 177M /hab/svc/postgresql/data/pgdata/pg_xlog/
> # ls -lah /hab/svc/postgresql/data/archive/ | wc -l
> 571
>
> There is no pg_wal directory though (should there be?)
>
> # find /hab/svc/postgresql/ -name '*pg*wal*'
> #
>
> >> If the only reason you want an archive is for replication, then use
> streaming replication and do away with the archive completely
>
> To be honest, I thought it was required for streaming replication based on
> the guides linked above.
>
> >> There are reasons other than replication that one might want to keep a
> WAL archive, but those reasons don't seem to apply to you
>
> Like backup maybe? A wholly other topic, we recently had a power outage
> and I lost a pgsql node... having an archive would allow me to "replay" any
> transactions?
>
> >> What needs to be determined here is why the standby never consumed the
> WAL's from the master?
> Ok, so it the standby that's the problem.
>
> >> Do you still have the logs from the standby and do they show anything
> relevant?
>
> Sure, what am I looking for? I see a bunch of entries like:
>
> 2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client= (0:00000)LOG:
> restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
> sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
> estimate=16383 kB
> 2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client= (0:00000)LOG:
> recovery restart point at 0/8D000028
>
> On the 15th, around when I think I filled the disk, I see a bunch of:
>
> cp: cannot stat '/00000002.history': No such file or directory
> cp: cannot stat '/000000010000000400000049': No such file or directory
> 2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client= (0:XX000)FATAL:
> could not connect to the primary server: could not connect to server:
> Connection refused
> Is the server running on host "172.16.10.23" and accepting
> TCP/IP connections on port 5432?
>
> Which makes sense since the pgsql service was down.
>
> This appears to be when I recovered the master on Thursday:
>
> cp: cannot stat '/00000002.history': No such file or directory
> cp: cannot stat '/00000001000000040000004D': No such file or directory
> 2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client= (0:XX000)FATAL:
> could not connect to the primary server: could not connect to server:
> Connection refused
> Is the server running on host "172.16.10.23" and accepting
> TCP/IP connections on port 5432?
>
> cp: cannot stat '/00000002.history': No such file or directory
> cp: cannot stat '/00000001000000040000004D': No such file or directory
> 2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
> started streaming WAL from primary at 4/4D000000 on timeline 1
> 2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
> (0:00000)LOG: restartpoint starting: time
> 2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
> (0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
> log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
> total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
> distance=16384 kB, estimate=16384 kB
>
> Then there's a bunch of the same entries where about the only thing (other
> than the timestamp) that seems to change is the index in:
>
> 2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
> (0:00000)LOG: recovery restart point at 5/90000140
> 2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
> (0:00000)LOG: recovery restart point at 5/91000028
> 2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
> (0:00000)LOG: recovery restart point at 5/91000108
>
> I see effectively the same logs on the other hotstandby pgsql node.
>
> This is my config on my standby node:
>
> listen_addresses = '0.0.0.0'
> port = 5432
> max_connections = 100
> external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
> authentication_timeout = 1min
> max_files_per_process = 1000
> max_locks_per_transaction = 64
> logging_collector = on
> log_directory = '/hab/svc/postgresql/var/pg_log'
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
> log_min_messages = ERROR
> datestyle = 'iso, mdy'
> default_text_search_config = 'pg_catalog.english'
> data_directory = '/hab/svc/postgresql/data/pgdata'
> hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
> wal_level = hot_standby
> wal_log_hints = 'on'
> hot_standby = 'on'
> hot_standby_feedback = true
> max_wal_senders = 5
> max_replication_slots = 5
> checkpoint_completion_target = 0.9
> max_wal_size = 1GB
> min_wal_size = 128MB
> wal_keep_segments = 8
> log_checkpoints = on
> log_lock_waits = on
> log_temp_files = 0
> log_autovacuum_min_duration = 0
> track_activity_query_size = 2048
> track_io_timing=on
> dynamic_shared_memory_type = 'none'
> archive_mode = 'on'
> archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
> archive_timeout = '10min'
> max_standby_archive_delay = '30s'
> synchronous_commit = local
> include '/hab/svc/postgresql/config/postgresql.local.conf'
>
> Should I disable archive mode? Even though I'm not currently using it, it
> seems like there's a use-case for having it? And if I can configure out
> what causing the backup of archive files it should still be manageable?
>
> Thanks again for all the replies, while it hasn't solved the problem yet,
> this was incredibly helpful! Also, please don't hate me for munging all
> your replies into one reply... I thought it might be easier to follow than
> having three different branches...?
>
> Best Regards,
> - QBR
>
> On Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael(at)paquier(dot)xyz>
> wrote:
>
>> On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
>> > archive_cleanup_command is pretty much obsolete. The modern way to do
>> this
>> > is with streaming replication, using either replication slots or
>> > wal_keep_segments. If the only reason you want an archive is for
>> > replication, then use streaming replication and do away with the archive
>> > completely. There are reasons other than replication that one might
>> want
>> > to keep a WAL archive, but those reasons don't seem to apply to you.
>> And
>> > if they did you almost certainly wouldn't want to run
>> > archive_cleanup_command on it.
>>
>> Personally, I still find archives also very valuable when a standby
>> creation takes a long time because of a large initial base backup and
>> that the partition dedicated to pg_wal is not large enough to support
>> the retention associated with a slot, and it is easier to have larger
>> retention policies in the archives.
>> --
>> Michael
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michel Pelletier 2019-04-01 16:30:18 Re: Gigantic load average spikes
Previous Message Sathish Kumar 2019-04-01 16:19:49 Re: Table Export & Import