Re: Improvements in pg_dump/pg_restore toc format and performances

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improvements in pg_dump/pg_restore toc format and performances
Date: 2023-09-18 21:52:47
Message-ID: 20230918215247.GA2661288@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> I ended up writing several patches that shaved some time for pg_restore -l,
> and reduced the toc.dat size.

I've only just started taking a look at these patches, and I intend to do a
more thorough review in the hopefully-not-too-distant future.

> First patch is "finishing" the job of removing has oids support. When this
> support was removed, instead of dropping the field from the dumps and
> increasing the dump versions, the field was kept as is. This field stores a
> boolean as a string, "true" or "false". This is not free, and requires 10
> bytes per toc entry.

This sounds reasonable to me. I wonder why this wasn't done when WITH OIDS
was removed in v12.

> The second patch removes calls to sscanf and replaces them with strtoul. This
> was the biggest speedup for pg_restore -l.

Nice.

> The third patch changes the dump format further to remove these strtoul calls
> and store the integers as is instead.

Do we need to worry about endianness here?

> The fourth patch is dirtier and does more changes to the dump format. Instead
> of storing the owner, tablespace, table access method and schema of each
> object as a string, pg_dump builds an array of these, stores them at the
> beginning of the file and replaces the strings with integer fields in the dump.
> This reduces the file size further, and removes a lot of calls to ReadStr, thus
> saving quite some time.

This sounds promising.

> Patch Toc size Dump -s duration pg_restore -l duration
> HEAD 214M 23.1s 1.27s
> #1 (has oid) 210M 22.9s 1.26s
> #2 (scanf) 210M 22.9s 1.07s
> #3 (no strtoul) 202M 22.8s 0.94s
> #4 (string list) 181M 23.1s 0.87s

At a glance, the size improvements in 0004 look the most interesting to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-09-18 21:54:42 Re: Improvements in pg_dump/pg_restore toc format and performances
Previous Message Nathan Bossart 2023-09-18 21:22:32 Re: Inefficiency in parallel pg_restore with many tables