From: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
---|---|
To: | Stephen Frost <sfrost(at)snowman(dot)net> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com> |
Subject: | Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary) |
Date: | 2022-04-22 13:37:16 |
Message-ID: | CALj2ACWqzkQO2n-NdOykP81yrepSeTU9t837sMMmUEHH5fnO9A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Apr 18, 2022 at 8:48 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> Greeting,
>
> * Bharath Rupireddy (bharath(dot)rupireddyforpostgres(at)gmail(dot)com) wrote:
> > On Mon, Apr 18, 2022 at 7:41 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > > * Bharath Rupireddy (bharath(dot)rupireddyforpostgres(at)gmail(dot)com) wrote:
> > > > Thanks for the comments. Here's a new tool called pg_walcleaner which
> > > > basically deletes (optionally archiving before deletion) the unneeded
> > > > WAL files.
> > > >
> > > > Please provide your thoughts and review the patches.
> > >
> > > Alright, I spent some more time thinking about this and contemplating
> > > what the next steps are... and I feel like the next step is basically
> > > "add a HINT when the server can't start due to being out of disk space
> > > that one should consider running pg_walcleaner" and at that point... why
> > > aren't we just, uh, doing that? This is all still quite hand-wavy, but
> > > it sure would be nice to be able to avoid downtime due to a broken
> > > archiving setup. pgbackrest has a way of doing this and while we, of
> > > course, discourage the use of that option, as it means throwing away
> > > WAL, it's an option that users have. PG could have a similar option.
> > > Basically, to archive_command/library what max_slot_wal_keep_size is for
> > > slots.
> >
> > Thanks. I get your point. The way I see it is that the postgres should
> > be self-aware of the about-to-get-full disk (probably when the data
> > directory size is 90%(configurable, of course) of total disk size) and
> > then freeze the new write operations (may be via new ALTER SYSTEM SET
> > READ-ONLY or setting default_transaction_read_only GUC) and then go
> > clean the unneeded WAL files by just invoking pg_walcleaner tool
> > perhaps. I think, so far, this kind of work has been done outside of
> > postgres. Even then, we might get into out-of-disk situations
> > depending on how frequently we check the data directory size to
> > compute the 90% configurable limit. Detecting the disk size is the KEY
> > here. Hence we need an offline invokable tool like pg_walcleaner.
>
> Ugh, last I checked, figuring out if a given filesystem is near being
> full is a pain to do in a cross-platform way. Why not just do exactly
> what we already are doing for replication slots, but for
> archive_command?
Do you mean to say that if the archvie_command fails, say, for "some
time" or "some number of attempts", just let the server not bother
about it and checkpoint delete the WAL files instead of going out of
disk? If this is the thought, then it's more dangerous as we might end
up losing the WAL forever. For invalidating replication slots, it's
okay because the required WAL can exist somewhere (either on the
primary or on the archive location).
> > > That isn't to say that we shouldn't also have a tool like this, but it
> > > generally feels like we're taking a reactive approach here rather than a
> > > proactive one to addressing the root issue.
> >
> > Agree. The offline tool like pg_walcleaner can help greatly even with
> > some sort of above internal/external disk space monitoring tools.
>
> See, this seems like a really bad idea to me. I'd be very concerned
> about people mis-using this tool in some way and automating its usage
> strikes me as absolutely exactly that.. Are we sure that we can
> guarantee that we don't remove things we shouldn't when this ends up
> getting run against a running cluster from someone's automated tooling?
> Or when someone sees that it refuses to run for $reason and tries to..
> "fix" that? Seems quite risky to me.. I'd probably want to put similar
> caveats around using this tool as I do around pg_resetwal when doing
> training- that is, don't ever, ever, ever use this, heh.
The initial version of the patch doesn't check if the server crashed
or not before running it. I was thinking of looking at the
postmaster.pid or pg_control file (however they don't guarantee
whether the server is up or crashed because the server can crash
without deleting postmaster.pid or updating pg_control file). Another
idea is to let pg_walcleaner fire a sample query ('SELECT 1') to see
if the server is up and running, if yes, exit, otherwise proceed with
its work.
Also, to not cause losing of WAL permanently, we must recommend using
archvie_command so that the WAL can be moved to an alternative
location (could be the same archvie_location that primary uses).
And yes, we must have clear usage guidelines in the docs.
Regards,
Bharath Rupireddy.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2022-04-22 13:38:35 | Re: Dump/Restore of non-default PKs |
Previous Message | Andrew Dunstan | 2022-04-22 13:15:31 | Re: pgsql: Allow db.schema.table patterns, but complain about random garbag |