Re: WAL Archive Cleanup?

From: Foo Bar <qubitrenegade(at)gmail(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: WAL Archive Cleanup?
Date: 2019-04-01 20:03:57
Message-ID: CACLHErS-1yyOKDrkW9iB2n2uxsTALQWS1vocO87NXkTE85VVNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,

Ok, I think maybe I've been approaching this all wrong. According to THIS
SO <https://dba.stackexchange.com/a/186136> post: "PostgreSQL doesn't
manage the archive directory for you, that is up to you. It doesn't even
know where (or what) the archive location is, that is why it asks you for
an archive_command, rather than just a directory."

Which I think is maybe a key component I was overlooking here. If pgsql
doesn't know anything about the `archive/` directory, then it should be
treated and managed "out of band" from the server?

So I'm thinking something like this:

Mount NFS share to `/pgbackup/`
Master server has once-per-week `pg_basebackup` command that creates a
`/pgbkup/base_backup/<date>` to some NFS share.
Master server has `archive_command` that creates archive in
`/pgbackup/archive`

On another server not part of the above cluster, run `pg_archivecleanup
/pgbackup/archive $(find /pgbackup/base_backup/<date> -name '*.backup')`
(i.e. "find the .backup file from `base_backup/<date>` and remove any
archive files?)

Then I effectively have a place to do point-in-time restores from?

Thanks,
- QBR

On Mon, Apr 1, 2019 at 1:44 PM Foo Bar <qubitrenegade(at)gmail(dot)com> wrote:

> Hello All,
>
> Ok, so creating the admin database has not enabled my cluster to cleanup
> the `archive/` directory... It did eliminate the errors, though I'm
> dubious as to if that was the correct solution...
>
> Everything I'm able to google talks about setting up archive replication,
> but not how to keep that directory in check... I did fine one link
> <https://support.qasymphony.com/hc/en-us/articles/360006834212-Continuous-WAL-Archiving-for-Windows>
> that that talks about how to cleanup obsolete archive files... but I
> thought I didn't want to use `pg_archivecleanup`?
>
> Also, I don't really undersatnd how `pg_basebackup` is supposed to work,
> as you can only ever run it once per directory?
>
> E.g.: if i wanted to run `pg_basebackup` weekly, I'd have to create a new
> directory every time?
>
> # pg_basebackup --verbose --host 127.0.0.1 --username replication --xlog
> --progress --pgdata /hab/svc/postgresql/data/archive/
> pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
> not empty
> # pg_archivecleanup -d /hab/svc/postgresql/data/archive/
> 000000010000000000000003.000000D0.backup
> pg_archivecleanup: keep WAL file
> "/hab/svc/postgresql/data/archive//000000010000000000000003" and later
> # ls /hab/svc/postgresql/data/archive
> 000000010000000800000081 000000010000000800000083
> 000000010000000800000085 000000010000000800000087
> 000000010000000800000089 00000001000000080000008B
> 00000001000000080000008D 00000001000000080000008F
> 000000010000000800000091 000000010000000800000093
> 000000010000000800000082 000000010000000800000084
> 000000010000000800000086 000000010000000800000088
> 00000001000000080000008A 00000001000000080000008C
> 00000001000000080000008E 000000010000000800000090 000000010000000800000092
> # rm -rf /hab/svc/postgresql/data/archive/*
> # pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
> replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
> Password:
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> transaction log start point: 8/96000028 on timeline 1
> 45946/45946 kB (100%), 1/1 tablespace
> transaction log end point: 8/960000F8
> pg_basebackup: base backup completed
> # pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
> replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
> pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
> not empty
> # pg_archivecleanup -d /hab/svc/postgresql/data/archive/
> 000000010000000800000096.00000028.backup
> pg_archivecleanup: keep WAL file
> "/hab/svc/postgresql/data/archive//000000010000000800000096" and later
> pg_archivecleanup: removing file
> "/hab/svc/postgresql/data/archive//000000010000000800000095"
> # pg_basebackup --verbose --host 127.0.0.1 --port 5432 --username
> replication --xlog --progress --pgdata /hab/svc/postgresql/data/archive/
> pg_basebackup: directory "/hab/svc/postgresql/data/archive/" exists but is
> not empty
>
>
> I feel like I'm missing something really fundamental here. Does everyone
> just have infinite storage for their `archive/` directory (that doesn't
> _seem_ plausible)?
>
> Should this maybe be stored on an off "box" location?
>
> I think I've figured out `wal_keep_segments` is for the streaming
> replication, i.e.: so I can have a hot-standby, and the `archive_command`
> is for point-in-time backups, e.g.: "oops we modified 10k rows and need to
> roll back the DB to this morning/an hour ago/20mins ago"... (or, another
> usecase I'd really like to be able to support is to give developers a
> replica of the production database, which this seems like it might not be
> the ideal way to go for that?)
>
> Thanks!
> - QBR
>
> On Mon, Apr 1, 2019 at 10:24 AM Foo Bar <qubitrenegade(at)gmail(dot)com> wrote:
>
>> Hello All,
>>
>> Ok, so maybe something helpful? On my master node I am seeing a bunch of:
>>
>> 2019-03-28 23:54:44 GMT [2611]: [1-1]
>> user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
>> database "admin" does not exist
>> 2019-03-28 23:54:46 GMT [2613]: [1-1]
>> user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
>> (0:3D000)FATAL: database "admin" does not exist
>>
>> Which is accurate, as there is no admin database... I usually connect
>> with
>>
>> psql -h localhost -U admin postgres
>>
>> Should there be? Will this fix my issue with pgsql filling up the disk?
>>
>> Thanks,
>> - QBR
>>
>> On Mon, Mar 25, 2019 at 10:21 AM Foo Bar <qubitrenegade(at)gmail(dot)com> wrote:
>>
>>> Hello All,
>>>
>>> Wow! Lots of awesome replies, Went away for the weekend thinking my
>>> email had been rejected and come back to a full inbox. Thanks for all the
>>> help!
>>>
>>> >> Postgres version?
>>>
>>> 9.6.11
>>>
>>> Doh. Fairly important detail there. :)
>>>
>>> >> FYI, psql is the Postgres client program, Postgres(ql) is the server.
>>> >> "psql" is the name of a specific command line tool used to connect
>>> to a PostgreSQL database server, it is not the name of the database itself.
>>> The database is usually abbreviated "pgsql".
>>>
>>> Duly noted, thanks for the correction.
>>>
>>> >> It's the standby that has not seen any traffic?
>>>
>>> There's really no traffic. I built three nodes, connected them, created
>>> a test table and inserted some values, then left the cluster be for a
>>> couple weeks.
>>>
>>> >> And "restartpoint" is usually spelled as one work in technical
>>> discussions of it. Or at least, searching for it that way avoids finding
>>> things which mention each word separately in different senses.
>>>
>>> Ah ha. I had seen it that way but thought it was a typo. Thanks for
>>> the clarification!
>>>
>>> >> Are you sure it is the archive
>>> directory (/hab/svc/postgresql/data/archive) which is filling up, and not
>>> the live directory (pg_wal or pg_xlog)? This is often a point of confusion.
>>>
>>> Right before I sent the mail last week I deleted everything in
>>> /hab/svc/postgresql/data/archive, this morning I'm seeing:
>>>
>>> # du -h --max=1 /hab/svc/postgresql/data/
>>> 198M /hab/svc/postgresql/data/pgdata
>>> 8.9G /hab/svc/postgresql/data/archive
>>> 9.1G /hab/svc/postgresql/data/
>>> # du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
>>> 177M /hab/svc/postgresql/data/pgdata/pg_xlog/
>>> # ls -lah /hab/svc/postgresql/data/archive/ | wc -l
>>> 571
>>>
>>> There is no pg_wal directory though (should there be?)
>>>
>>> # find /hab/svc/postgresql/ -name '*pg*wal*'
>>> #
>>>
>>> >> If the only reason you want an archive is for replication, then use
>>> streaming replication and do away with the archive completely
>>>
>>> To be honest, I thought it was required for streaming replication based
>>> on the guides linked above.
>>>
>>> >> There are reasons other than replication that one might want to keep
>>> a WAL archive, but those reasons don't seem to apply to you
>>>
>>> Like backup maybe? A wholly other topic, we recently had a power outage
>>> and I lost a pgsql node... having an archive would allow me to "replay" any
>>> transactions?
>>>
>>> >> What needs to be determined here is why the standby never consumed
>>> the WAL's from the master?
>>> Ok, so it the standby that's the problem.
>>>
>>> >> Do you still have the logs from the standby and do they show anything
>>> relevant?
>>>
>>> Sure, what am I looking for? I see a bunch of entries like:
>>>
>>> 2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client=
>>> (0:00000)LOG: restartpoint complete: wrote 22 buffers (0.0%); 0
>>> transaction log file(s) added, 0 removed, 1 recycled; write=2.211 s,
>>> sync=0.062 s, total=2.281 s; sync files=18, longest=0.062 s, average=0.003
>>> s; distance=16383 kB, estimate=16383 kB
>>> 2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client=
>>> (0:00000)LOG: recovery restart point at 0/8D000028
>>>
>>> On the 15th, around when I think I filled the disk, I see a bunch of:
>>>
>>> cp: cannot stat '/00000002.history': No such file or directory
>>> cp: cannot stat '/000000010000000400000049': No such file or directory
>>> 2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client=
>>> (0:XX000)FATAL: could not connect to the primary server: could not connect
>>> to server: Connection refused
>>> Is the server running on host "172.16.10.23" and
>>> accepting
>>> TCP/IP connections on port 5432?
>>>
>>> Which makes sense since the pgsql service was down.
>>>
>>> This appears to be when I recovered the master on Thursday:
>>>
>>> cp: cannot stat '/00000002.history': No such file or directory
>>> cp: cannot stat '/00000001000000040000004D': No such file or directory
>>> 2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client=
>>> (0:XX000)FATAL: could not connect to the primary server: could not connect
>>> to server: Connection refused
>>> Is the server running on host "172.16.10.23" and
>>> accepting
>>> TCP/IP connections on port 5432?
>>>
>>> cp: cannot stat '/00000002.history': No such file or directory
>>> cp: cannot stat '/00000001000000040000004D': No such file or directory
>>> 2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client= (0:00000)LOG:
>>> started streaming WAL from primary at 4/4D000000 on timeline 1
>>> 2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
>>> (0:00000)LOG: restartpoint starting: time
>>> 2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
>>> (0:00000)LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
>>> log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
>>> total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
>>> distance=16384 kB, estimate=16384 kB
>>>
>>> Then there's a bunch of the same entries where about the only thing
>>> (other than the timestamp) that seems to change is the index in:
>>>
>>> 2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
>>> (0:00000)LOG: recovery restart point at 5/90000140
>>> 2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
>>> (0:00000)LOG: recovery restart point at 5/91000028
>>> 2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
>>> (0:00000)LOG: recovery restart point at 5/91000108
>>>
>>> I see effectively the same logs on the other hotstandby pgsql node.
>>>
>>> This is my config on my standby node:
>>>
>>> listen_addresses = '0.0.0.0'
>>> port = 5432
>>> max_connections = 100
>>> external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
>>> authentication_timeout = 1min
>>> max_files_per_process = 1000
>>> max_locks_per_transaction = 64
>>> logging_collector = on
>>> log_directory = '/hab/svc/postgresql/var/pg_log'
>>> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
>>> log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
>>> log_min_messages = ERROR
>>> datestyle = 'iso, mdy'
>>> default_text_search_config = 'pg_catalog.english'
>>> data_directory = '/hab/svc/postgresql/data/pgdata'
>>> hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
>>> wal_level = hot_standby
>>> wal_log_hints = 'on'
>>> hot_standby = 'on'
>>> hot_standby_feedback = true
>>> max_wal_senders = 5
>>> max_replication_slots = 5
>>> checkpoint_completion_target = 0.9
>>> max_wal_size = 1GB
>>> min_wal_size = 128MB
>>> wal_keep_segments = 8
>>> log_checkpoints = on
>>> log_lock_waits = on
>>> log_temp_files = 0
>>> log_autovacuum_min_duration = 0
>>> track_activity_query_size = 2048
>>> track_io_timing=on
>>> dynamic_shared_memory_type = 'none'
>>> archive_mode = 'on'
>>> archive_command = 'cp %p /hab/svc/postgresql/data/archive/%f'
>>> archive_timeout = '10min'
>>> max_standby_archive_delay = '30s'
>>> synchronous_commit = local
>>> include '/hab/svc/postgresql/config/postgresql.local.conf'
>>>
>>> Should I disable archive mode? Even though I'm not currently using it,
>>> it seems like there's a use-case for having it? And if I can configure out
>>> what causing the backup of archive files it should still be manageable?
>>>
>>> Thanks again for all the replies, while it hasn't solved the problem
>>> yet, this was incredibly helpful! Also, please don't hate me for munging
>>> all your replies into one reply... I thought it might be easier to follow
>>> than having three different branches...?
>>>
>>> Best Regards,
>>> - QBR
>>>
>>> On Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <michael(at)paquier(dot)xyz>
>>> wrote:
>>>
>>>> On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
>>>> > archive_cleanup_command is pretty much obsolete. The modern way to
>>>> do this
>>>> > is with streaming replication, using either replication slots or
>>>> > wal_keep_segments. If the only reason you want an archive is for
>>>> > replication, then use streaming replication and do away with the
>>>> archive
>>>> > completely. There are reasons other than replication that one might
>>>> want
>>>> > to keep a WAL archive, but those reasons don't seem to apply to you.
>>>> And
>>>> > if they did you almost certainly wouldn't want to run
>>>> > archive_cleanup_command on it.
>>>>
>>>> Personally, I still find archives also very valuable when a standby
>>>> creation takes a long time because of a large initial base backup and
>>>> that the partition dedicated to pg_wal is not large enough to support
>>>> the retention associated with a slot, and it is easier to have larger
>>>> retention policies in the archives.
>>>> --
>>>> Michael
>>>>
>>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michel Pelletier 2019-04-01 20:31:51 Re: Help with insert query
Previous Message Adrian Klaver 2019-04-01 19:45:38 Re: Help with insert query