Speeding up ruleutils' name de-duplication code, redux

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Speeding up ruleutils' name de-duplication code, redux
Date: 2024-07-29 22:14:10
Message-ID: 2885468.1722291250@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When deparsing queries or expressions, ruleutils.c has to generate
unique names for RTEs and columns of RTEs. (Often, they're unique
already, but this isn't reliably true.) The original logic for that
involved just strcmp'ing a proposed name against all the ones already
assigned, which obviously is O(N^2) in the number of names being
considered. Back in commit 8004953b5, we fixed that problem for
generation of unique RTE names, by using a hash table to remember the
already-assigned names. However, that commit did not touch the logic
for de-duplicating the column names within each RTE, explaining

In principle the same problem applies to the column-name-de-duplication
code; but in practice that seems to be less of a problem, first because
N is limited since we don't support extremely wide tables, and second
because duplicate column names within an RTE are fairly rare, so that in
practice the cost is more like O(N^2) not O(N^3). It would be very much
messier to fix the column-name code, so for now I've left that alone.

But I think the time has come to do something about it. In [1]
I presented this Perl script to generate a database that gives
pg_upgrade indigestion:

-----
for (my $i = 0; $i < 100; $i++)
{
print "CREATE TABLE test_inh_check$i (\n";
for (my $j = 0; $j < 1000; $j++)
{
print "a$j float check (a$j > 10.2),\n";
}
print "b float);\n";
print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n";
}
-----

On my development machine, it takes over 14 minutes to pg_upgrade
this, and it turns out that that time is largely spent in column
name de-duplication while deparsing the CHECK constraints. The
attached patch reduces that to about 3m45s.

(I think that we ought to reconsider MergeConstraintsIntoExisting's
use of deparsing to compare check constraints: it'd be faster and
probably more reliable to apply attnum translation to one parsetree
and then use equal(). But that's a matter for a different patch, and
this patch would still be useful for the pg_dump side of the problem.)

I was able to avoid a lot of the complexity I'd feared before by not
attempting to use hashing during set_using_names(), which only has to
consider columns merged by USING clauses, so it shouldn't have enough
of a performance problem to be worth touching. The hashing code needs
to be optional anyway because it's unlikely to be a win for narrow
tables, so we can simply ignore it until we reach the potentially
expensive steps. Also, things are already factored in such a way that
we only need to have one hashtable at a time, so this shouldn't cause
any large memory bloat.

I'll park this in the next CF.

regards, tom lane

[1] https://www.postgresql.org/message-id/2422717.1722201869%40sss.pgh.pa.us

Attachment Content-Type Size
v1-speed-up-column-name-deduplication.patch text/x-diff 9.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-07-29 22:25:59 Re: ecdh support causes unnecessary roundtrips
Previous Message Dean Rasheed 2024-07-29 21:59:57 Re: Optimize mul_var() for var1ndigits >= 8