Quick Links

Re: parallel pg_restore blocks on heavy random read I/O on all children processes

From:	Hannu Krosing <hannuk(at)google(dot)com>
To:	Dimitrios Apostolou <jimis(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-performance(at)lists(dot)postgresql(dot)org
Subject:	Re: parallel pg_restore blocks on heavy random read I/O on all children processes
Date:	2025-04-10 06:50:33
Message-ID:	CAMT0RQTz7Zi99C66U2160Mmcxj+fBJ0OpD6Eq=aLZsdYaFZwBg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

You may be interested in a patch "Adding pg_dump flag for parallel
export to pipes"[1] which allows using pipes in directory former
parallel dump and restore.
There the offsets are implicitly taken care of by the file system.

[1] https://www.postgresql.org/message-id/CAH5HC97p4kkpikar%2BswuC0Lx4YTVkE30sTsFX94tyzih7Cc_%3Dw%40mail.gmail.com

On Sun, Mar 23, 2025 at 4:46 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
> > On Thu, 20 Mar 2025, Tom Lane wrote:
> >> I am betting that the problem is that the dump's TOC (table of
> >> contents) lacks offsets to the actual data of the database objects,
> >> and thus the readers have to reconstruct that information by scanning
> >> the dump file. Normally, pg_dump will back-fill offset data in the
> >> TOC at completion of the dump, but if it's told to write to an
> >> un-seekable output file then it cannot do that.
>
> > Further questions:
>
> > * Does the same happen in an uncompressed dump? Or maybe the offsets are
> > pre-filled because they are predictable without compression?
>
> Yes; no. We don't know the size of a table's data as-dumped until
> we've dumped it.
>
> > * Should pg_dump print some warning for generating a lower quality format?
>
> I don't think so. In many use-cases this is irrelevant and the
> warning would just be an annoyance.
>
> > * The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
> > jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
> > hour. /Maybe/ something to improve there... Where can I read more about
> > the format?
>
> It's reading data blocks (or at least the headers thereof), which have
> a limited size. I don't think that size has changed since circa 1999,
> so maybe we could consider increasing it; but I doubt we could move
> the needle very far that way.
>
> > * Why doesn't it happen in single-process pg_restore?
>
> A single-process restore is going to restore all the data in the order
> it appears in the archive file, so no seeking is required. Of course,
> as soon as you ask for parallelism, that doesn't work too well.
>
> Hypothetically, maybe the algorithm for handing out tables-to-restore
> to parallel workers could pay attention to the distance to the data
> ... except that in the problematic case we don't have that
> information. I don't recall for sure, but I think that the order of
> the TOC entries is not necessarily a usable proxy for the order of the
> data entries. It's unclear to me that overriding the existing
> heuristic (biggest tables first, I think) would be a win anyway.
>
> regards, tom lane
>
>

In response to

Re: parallel pg_restore blocks on heavy random read I/O on all children processes at 2025-03-23 15:46:42 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	James Pang	2025-04-11 14:36:57	many sessions wait on LWlock WALWrite suddenly
Previous Message	Vitale, Anthony, Sony Music	2025-04-09 15:47:32	RE: Question on what Duration in the log