Re: parallel pg_restore blocks on heavy random read I/O on all children processes

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: parallel pg_restore blocks on heavy random read I/O on all children processes
Date: 2025-03-24 16:43:38
Message-ID: 3764bde7-ca09-a8a3-489c-0166f772245d@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, 23 Mar 2025, Tom Lane wrote:

> Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
>> On Thu, 20 Mar 2025, Tom Lane wrote:
>>> I am betting that the problem is that the dump's TOC (table of
>>> contents) lacks offsets to the actual data of the database objects,
>>> and thus the readers have to reconstruct that information by scanning
>>> the dump file. Normally, pg_dump will back-fill offset data in the
>>> TOC at completion of the dump, but if it's told to write to an
>>> un-seekable output file then it cannot do that.
>
>> Further questions:
>
>> * The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
>> jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
>> hour. /Maybe/ something to improve there... Where can I read more about
>> the format?
>
> It's reading data blocks (or at least the headers thereof), which have
> a limited size. I don't think that size has changed since circa 1999,
> so maybe we could consider increasing it; but I doubt we could move
> the needle very far that way.

If it could be substantially increased then it would cut down the number
of seeks by a big factor and definitely make a difference. But it would
have a negative impact on the size of small backups?

However, I would have expected the kernel to handle this reading-seeking
pattern better: The 4K read() should cause a big pre-fetch (read-ahead) to
cover the next seek and read. Not sure why I don't see it happening, it's
a rather typical setup (NVMe SSD with ext4 fs on Linux 6.11).

As a workaround, pg_restore could sequentially read() the whole archive to
construct the TOC, instead of jumping around. I'm a bit split on this one,
as it would be much faster on the compressed archive (~800GB) but not that
much on the uncompressed one (>5TB).

I'm also under the impression that all the pg_restore child processes are
doing the same thing: seeking+reading the whole file, each for their own
sake. Can you verify or disprove?

Thank you for the detailed response.

Dimitris

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dimitrios Apostolou 2025-03-27 23:45:58 Re: parallel pg_restore blocks on heavy random read I/O on all children processes
Previous Message Tom Lane 2025-03-23 15:46:42 Re: parallel pg_restore blocks on heavy random read I/O on all children processes