Performance issue in pg_dump's dependency loop searching

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Joe Van Dyk <joe(at)tanga(dot)com>
Subject: Performance issue in pg_dump's dependency loop searching
Date: 2014-07-25 19:47:37
Message-ID: 27628.1406317657@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into the performance issue Joe Van Dyk reported in bug #11033.
It turns out this is basically a consequence of the "section boundary"
pseudo-objects I added in commit a1ef01fe163b304760088e3e30eb22036910a495.
(I'd worried about performance consequences at the time, but hadn't
seen any actual evidence that there would be a problem.)

In Joe's test case, there are 8119 objects that need to be sorted into a
dumpable order. (Only about 2700 of these are user objects that actually
end up in the output file; the rest are system objects like collations and
operators, which we include in the sorting pass in case there are
dependency paths leading through them.) 5976 of these are "pre-data
objects", so we end up with 5976 dependencies for the pre-data boundary
object; and then there are 386 data objects, which depend on the pre-data
boundary object and which the post-data boundary object depends on; and
then 1755 post-data objects, which all have dependencies on the post-data
boundary object.

You can probably already see where this is going :-(. For each post-data
object that's potentially involved in a dependency loop, the findLoop()
search in pg_dump_sort.c will need to recurse not only to the object's
ordinary dependencies (of which there are probably just a few), but to the
post-data boundary object. From there it will need to recurse to each
data object. And from each data object, it will recurse to the
pre-data boundary object. And from there, it'll have to recurse to each
and every pre-data object. Not to mention recursing to the actual
dependencies of each one. Now the recursion is pretty quick, but still
this is a significant number of cycles.

But wait, it gets worse.

In a data-only dump, we have even more dependencies, because we sort all
these objects (even though we won't dump most of them), and we also add
dependencies representing foreign-key constraints, in hopes of arriving at
a dump order wherein the foreign-key constraints won't be violated during
the restore. Joe's example has about 320 FK constraints, some of them in
fairly long chains (ie, table A references table B which references table
C which references table D etc). What this means is that there are a
multitude of dependency paths from the post-data boundary to the pre-data
boundary, for instance from boundary to table A to boundary, or from
boundary to table A to table B to boundary, or from boundary to table A to
table B to table C to boundary, etc etc etc. After chasing down *each* of
these paths, findLoop has to visit every one of the pre-data objects.

So what's surprising is not so much that the sort takes 30 seconds as that
it finishes before the heat death of the universe.

I've found a fairly simple fix for this, which depends on the observation
that once we've failed to find a dependency path from object X back to our
starting object, there's no need to search again if we arrive at X via a
different dependency path. This is noninvasive and obviously correct, and
it seems to completely eliminate the performance issue in Joe's test case.

There are other things that we might consider doing, like excluding system
objects from the dependency sort; but I'm less excited about that because
it's not entirely clear that there wouldn't be unexpected behavioral
consequences. In any case, the more user objects are in the database, the
less it would matter. We've definitely heard of databases with tens of
thousands of user tables for instance. Another idea would be to somehow
be smarter about quickly eliminating objects that aren't actually members
of any dependency loop, but only depend on objects that are in loops.
But that seems like a research project; I definitely don't see a cheap
way to get TopoSort to detect such cases.

So I propose applying the attached, and back-patching to 9.1, which is
as far back as the section-boundary objects are used.

regards, tom lane

Attachment Content-Type Size
avoid-repetitive-dependency-searches.patch text/x-diff 5.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-07-25 19:52:12 Re: implement subject alternative names support for SSL connections
Previous Message Noah Misch 2014-07-25 19:25:13 [w32] test_shm_mq test suite permanently burns connections slots