Re: Speeding up ruleutils' name de-duplication code, redux

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Speeding up ruleutils' name de-duplication code, redux
Date: 2024-09-10 09:57:21
Message-ID: CAApHDvriAQ4W=2B-t7R+N77Bmq+HZLAf3ReQiyGXeZMke7yibQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 30 Jul 2024 at 10:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> -----
> for (my $i = 0; $i < 100; $i++)
> {
> print "CREATE TABLE test_inh_check$i (\n";
> for (my $j = 0; $j < 1000; $j++)
> {
> print "a$j float check (a$j > 10.2),\n";
> }
> print "b float);\n";
> print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n";
> }
> -----
>
> On my development machine, it takes over 14 minutes to pg_upgrade
> this, and it turns out that that time is largely spent in column
> name de-duplication while deparsing the CHECK constraints. The
> attached patch reduces that to about 3m45s.

I think this is worth doing. Reducing the --schema-only time in
pg_dump is a worthy goal to reduce downtime during upgrades.

I looked at the patch and tried it out. I wondered about the choice of
32 as the cut-off point so decided to benchmark using the attached
script. Here's an extract from the attached results:

Patched with 10 child tables
pg_dump for 16 columns real 0m0.068s
pg_dump for 31 columns real 0m0.080s
pg_dump for 32 columns real 0m0.083s

This gives me what I'd expect to see. I wanted to ensure the point
where you're switching to the hashing method was about the right
place. It seems to be, at least for my test.

The performance looks good too:

10 tables:
master: pg_dump for 1024 columns real 0m23.053s
patched: pg_dump for 1024 columns real 0m1.573s

100 tables:
master: pg_dump for 1024 columns real 3m29.857s
patched: pg_dump for 1024 columns real 0m23.053s

Perhaps you don't think it's worth the additional complexity, but I
see that in both locations you're calling build_colinfo_names_hash(),
it's done just after a call to expand_colnames_array_to(). I wondered
if it was worthwhile unifying both of those functions maybe with a new
name so that you don't need to loop over the always NULL element of
the colnames[] array when building the hash table. This is likely
quite a small overhead compared to the quadratic search you've
removed, so it might not move the needle any. I just wanted to point
it out as I've little else I can find to comment on.

David

Attachment Content-Type Size
pgdump_bench.sh.txt text/plain 561 bytes
pgdump_bench_results.txt text/plain 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wenhui qiu 2024-09-10 10:06:47 Re: Proposal to Enable/Disable Index using ALTER INDEX
Previous Message shveta malik 2024-09-10 09:55:58 Re: Conflict detection for update_deleted in logical replication