Re: filesystem full during vacuum - space recovery issues

From: Thomas Simpson <ts(at)talentstack(dot)to>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: filesystem full during vacuum - space recovery issues
Date: 2024-07-17 13:24:43
Message-ID: 38b20a9f-ef8a-486a-bb7d-f7a7be20f98d@talentstack.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

Thanks Laurenz & Imran for your comments.

My responses inline below.

Thanks

Tom

On 15-Jul-2024 20:58, Laurenz Albe wrote:
> On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
>> I have a large database (multi TB) which had a vacuum full running but the database
>> ran out of space during the rebuild of one of the large data tables.
>>
>> Cleaning down the WAL files got the database restarted (an archiving problem led to
>> the initial disk full).
>>
>> However, the disk space is still at 99% as it appears the large table rebuild files
>> are still hanging around using space and have not been deleted.
>>
>> My problem now is how do I get this space back to return my free space back to where
>> it should be?
>>
>> I tried some scripts to map the data files to relations but this didn't work as
>> removing some files led to startup failure despite them appearing to be unrelated
>> to anything in the database - I had to put them back and then startup worked.
>>
>> Any suggestions here?
> That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
> very files that would have enabled PostgreSQL to recover from the crash that was
> caused by the full file system.
>
> Did you run "pg_resetwal"? If yes, that probably led to data corruption.

No, I just removed the excess already archived WALs to get space and
restarted.  The vacuum full that was running had created files for the
large table it was processing and these are still hanging around eating
space without doing anything useful.  The shutdown prevented the
rollback cleanly removing them which seems to be the core problem.

> The above are just guesses. Anyway, there is no good way to get rid of the files
> that were left behind after the crash. The reliable way of doing so is also the way
> to get rid of potential data corruption caused by "cleaning down" the database:
> pg_dump the whole thing and restore the dump to a new, clean cluster.
>
> Yes, that will be a painfully long down time. An alternative is to restore a backup
> taken before the crash.

My issue now is the dump & reload is taking a huge time; I know the
hardware is capable of multi-GB/s throughput but the reload is taking a
long time - projected to be about 10 days to reload at the current rate
(about 30Mb/sec).  The old server and new server have a 10G link between
them and storage is SSD backed, so the hardware is capable of much much
more than it is doing now.

Is there a way to improve the reload performance?  Tuning of any type -
even if I need to undo it later once the reload is done.

My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload
since this DB has been through multiple upgrades (without reloads) until
now so I know it's not especially clean. The size has always prevented
the full reload before but the database is relatively low traffic now so
I can afford some time to reload, but ideally not 10 days.

> Yours,
> Laurenz Albe

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Ron Johnson 2024-07-17 13:34:30 Re: Error; pg_upgrade 10 to 15
Previous Message SOzcn 2024-07-17 13:24:04 Error; pg_upgrade 10 to 15

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2024-07-17 13:31:56 Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
Previous Message Alexander Pyhalov 2024-07-17 13:24:28 Asynchronous MergeAppend