Quick Links

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

From:	Thomas Simpson <ts(at)talentstack(dot)to>
To:	Scott Ribe <scott_ribe(at)elevated-dev(dot)com>
Cc:	pgsql-admin(at)lists(dot)postgresql(dot)org
Subject:	Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)
Date:	2024-07-19 20:23:29
Message-ID:	42e5a766-2af7-4e2b-b82d-373da77c35ed@talentstack.to
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin pgsql-hackers

Hi Scott

On 19-Jul-2024 15:34, Scott Ribe wrote:
>> On Jul 19, 2024, at 7:46 AM, Thomas Simpson<ts(at)talentstack(dot)to> wrote:
>>
>> I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
> I was gone from my computer for a day and lost track of the thread.
>
> Perhaps logical replication could help you out here?

I'm not sure - perhaps, but at this point, I've got that dump/reload
running and provided it completes ok (in about 20 days time at current
rate), I'll be fine with this.

The database itself is essentially an archive of data so is no longer
being added to at this point, so it's an annoyance for the rebuild time
rather than a disaster.

[But incidentally, I am working on an even larger project which is
likely to make this one seem small, so improvement around large
databases is important to me.]

However, my thought is around how to avoid this issue in the future and
to improve the experience for others faced with the dump-reload which is
always the fall-back upgrade suggestion between versions.

Getting parallelism should be possible and the current pg_dump does that
for directory mode from what I can see - making multiple threads etc.
according to parallel.c in pg_dump, it even looks like most of where my
thought process was going is actually already there.

The extension should be adding synchronization/checkpointing between the
generating dump and the receiving reload to ensure objects are not
processed until all their requirements are already present in the new
database. This is all based around routing via network streams instead
of the filesystem as currently happens.

Perhaps this is already in place since the restore can be done in
parallel, so must need to implement that ordering already? If someone
with a good understanding of dump is able to comment or even give
suggestions, I'm not against making an attempt to implement something as
a first attempt.

I see Tom Lane from git blame did a bunch of work around the parallel
dump back in 2020 - perhaps he could make suggestions either via private
direct email or the list ?

Thanks

Tom

In response to

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues) at 2024-07-19 19:34:56 from Scott Ribe

Responses

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues) at 2024-07-19 21:21:50 from Doug Reynolds

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Doug Reynolds	2024-07-19 21:21:50	Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)
Previous Message	Scott Ribe	2024-07-19 19:34:56	Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2024-07-19 20:44:22	Re: pg_upgrade and logical replication
Previous Message	Tom Lane	2024-07-19 20:17:41	Re: documentation structure