Re: hashjoins, index loops to retrieve pk/ux constrains in pg12

From: Arturas Mazeika <mazeika(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Lewis <mlewis(at)entrata(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: hashjoins, index loops to retrieve pk/ux constrains in pg12
Date: 2021-09-29 13:05:36
Message-ID: CAAUL=cFjKMy2dLZp2vAZ8=PHof93tPgKFThgk-OLSoqUt5U2Uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Tom,

I agree that the query needs to be first correct, and second fast. I also
agree that this query works only if there are no duplicates among schemas
(if one chooses to create a table with the same names and index names and
constraint names in a different schema, this would not work). Provided the
assumptions are correct (what it is on our customer systems), we use
intermediate liquibase scripts to keep track of our database (schema)
changes, those intermediate scripts fire queries as mentioned above, i.e.,
we cannot directly influence how the query looks like.

Given these very hard constraints (i.e., the query is formulated using
information_schema, and not directly) is it possible to assess why the hash
joins plan is chosen? At the end of the day, the io block hit rate of this
query in hash joins is 3-4 orders of magnitude higher compared to
sort/index joins? Is there anything one can do on the configuration side to
avoid such hash-join pitfalls?

Cheers,
Arturas

On Tue, Sep 28, 2021 at 4:13 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Arturas Mazeika <mazeika(at)gmail(dot)com> writes:
> > Thanks a lot for having a look at the query once again in more detail. In
> > short, you are right, I fired the liquibase scripts and observed the
> exact
> > query that was hanging in pg_stats_activity. The query was:
>
> > SELECT
> > FK.TABLE_NAME as "TABLE_NAME"
> > , CU.COLUMN_NAME as "COLUMN_NAME"
> > , PK.TABLE_NAME as "REFERENCED_TABLE_NAME"
> > , PT.COLUMN_NAME as "REFERENCED_COLUMN_NAME"
> > , C.CONSTRAINT_NAME as "CONSTRAINT_NAME"
> > FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
> > INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK ON
> > C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
> > INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK ON
> > C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
> > INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU ON C.CONSTRAINT_NAME
> > = CU.CONSTRAINT_NAME
> > INNER JOIN (
> > SELECT
> > i1.TABLE_NAME
> > , i2.COLUMN_NAME
> > FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
> > INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2 ON
> > i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
> > WHERE i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
> > ) PT ON PT.TABLE_NAME = PK.TABLE_NAME WHERE
> > lower(FK.TABLE_NAME)='secrole_condcollection'
>
> TBH, before worrying about performance you should be worrying about
> correctness. constraint_name alone is not a sufficient join key
> for these tables, so who's to say whether you're even getting the
> right answers?
>
> Per SQL spec, the join key to use is probably constraint_catalog
> plus constraint_schema plus constraint_name. You might say you
> don't need to compare constraint_catalog because that's fixed
> within any one Postgres database, and that observation would be
> correct. But you can't ignore the schema.
>
> What's worse, the SQL-spec join keys are based on the assumption that
> constraint names are unique within schemas, which is not enforced in
> Postgres. Maybe you're all right here, because you're only looking
> at primary key constraints, which are associated with indexes, which
> being relations do indeed have unique-within-schema names. But you
> still can't ignore the schema.
>
> On the whole I don't think you're buying anything by going through
> the SQL-spec information views, because this query is clearly pretty
> dependent on Postgres-specific assumptions even if it looks like it's
> portable. And you're definitely giving up a lot of performance, since
> those views have so many complications from trying to map the spec's
> view of whats-a-constraint onto the Postgres objects (not to mention
> the spec's arbitrary opinions about which objects you're allowed to
> see). This query would be probably be simpler, more correct, and a
> lot faster if rewritten to query the Postgres catalogs directly.
>
> regards, tom lane
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Fabien COELHO 2021-09-29 13:47:40 How to improve cockroach performance with pgbench?
Previous Message Pavel Stehule 2021-09-29 04:59:05 Re: Problem with indices from 10 to 13