Quick Links

Re: Experimenting with hash tables inside pg_dump

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Experimenting with hash tables inside pg_dump
Date:	2021-10-22 05:59:39
Message-ID:	20211022055939.z6fihsm7hdzbjttf@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2021-10-21 22:13:22 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > I wonder though if for some of them we should instead replace the per-object
> > queries with one query returning the information for all objects of a type. It
> > doesn't make all that much sense that we build and send one query for each
> > table and index.
>
> The trick is the problem I alluded to in another thread: it's not safe to
> do stuff like pg_get_expr() on tables we don't have lock on.

I was looking at getTableAttrs() - sending one query instead of #tables
queries yields a quite substantial speedup in a quick prototype. And I don't
think it changes anything around locking semantics.

> I've thought about doing something like
>
> SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)
>
> but in cases with tens of thousands of tables, it seems unlikely that
> that's going to behave all that nicely.

That's kinda what I'm doing in the quick hack. But instead of using IN(...) I
made it unnest('{oid, oid, ...}'), that scales much better.

A pg_dump --schema-only of the regression database goes from

real 0m0.675s
user 0m0.039s
sys 0m0.029s

real 0m0.477s
user 0m0.037s
sys 0m0.020s

which isn't half-bad.

There's a few more cases like this I think. But most are harder because the
dumping happens one-by-one from dumpDumpableObject(). The relatively easy but
substantial cases I could find quickly were getIndexes(), getConstraints(),
getTriggers()

To see where it's worth putting in time it'd be useful if getSchemaData() in
verbose mode printed timing information...

> The *real* fix, I suppose, would be to fix all those catalog-inspection
> functions so that they operate with respect to the query's snapshot.
> But that's not a job I'm volunteering for. Besides which, pg_dump
> still has to cope with back-rev servers where it wouldn't be safe.

Yea, that's not a small change :(. I suspect that we'd need a bunch of new
caching infrastructure to make that reasonably performant, since this
presumably couldn't use syscache etc.

Greetings,

Andres Freund

Attachment	Content-Type	Size
pg_dump-bulk-gettableattrs.diff	text/x-diff	25.1 KB

In response to

Re: Experimenting with hash tables inside pg_dump at 2021-10-22 02:13:22 from Tom Lane

Responses

Re: Experimenting with hash tables inside pg_dump at 2021-10-22 14:53:31 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Pyhalov	2021-10-22 06:26:50	Re: Partial aggregates pushdown
Previous Message	Masahiko Sawada	2021-10-22 05:38:02	Re: Parallel vacuum workers prevent the oldest xmin from advancing