Regression instability + performance issue in TRIGGERS view

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Regression instability + performance issue in TRIGGERS view
Date: 2020-04-22 22:27:50
Message-ID: 5891.1587594470@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into the cause of several recent buildfarm failures:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2020-04-20%2020%3A32%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-04-18%2018%3A20%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=devario&dt=2020-02-27%2014%3A18%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ayu&dt=2020-01-17%2022%3A56%3A13

all of which look like this:

diff -U3 /home/filiperosset/dev/build-farm-11/HEAD/pgsql.build/src/test/regress/expected/triggers.out /home/filiperosset/dev/build-farm-11/HEAD/pgsql.build/src/test/regress/results/triggers.out
--- /home/filiperosset/dev/build-farm-11/HEAD/pgsql.build/src/test/regress/expected/triggers.out 2020-03-19 17:54:52.037720127 -0300
+++ /home/filiperosset/dev/build-farm-11/HEAD/pgsql.build/src/test/regress/results/triggers.out 2020-04-20 17:44:02.024230072 -0300
@@ -2559,22 +2559,7 @@
FROM information_schema.triggers
WHERE event_object_table IN ('parent', 'child1', 'child2', 'child3')
ORDER BY trigger_name COLLATE "C", 2;
- trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
---------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
- child1_delete_trig | DELETE | public | child1 | 1 | | STATEMENT | AFTER | old_table |
- child1_insert_trig | INSERT | public | child1 | 1 | | STATEMENT | AFTER | | new_table
- child1_update_trig | UPDATE | public | child1 | 1 | | STATEMENT | AFTER | old_table | new_table
- child2_delete_trig | DELETE | public | child2 | 1 | | STATEMENT | AFTER | old_table |
- child2_insert_trig | INSERT | public | child2 | 1 | | STATEMENT | AFTER | | new_table
- child2_update_trig | UPDATE | public | child2 | 1 | | STATEMENT | AFTER | old_table | new_table
- child3_delete_trig | DELETE | public | child3 | 1 | | STATEMENT | AFTER | old_table |
- child3_insert_trig | INSERT | public | child3 | 1 | | STATEMENT | AFTER | | new_table
- child3_update_trig | UPDATE | public | child3 | 1 | | STATEMENT | AFTER | old_table | new_table
- parent_delete_trig | DELETE | public | parent | 1 | | STATEMENT | AFTER | old_table |
- parent_insert_trig | INSERT | public | parent | 1 | | STATEMENT | AFTER | | new_table
- parent_update_trig | UPDATE | public | parent | 1 | | STATEMENT | AFTER | old_table | new_table
-(12 rows)
-
+ERROR: cache lookup failed for function 22540
-- insert directly into children sees respective child-format tuples
insert into child1 values ('AAA', 42);
NOTICE: trigger = child1_insert_trig, new table = (AAA,42)

What appears to be happening is that the concurrent updatable_views
test creates/deletes some triggers and trigger functions, and with
bad timing luck that results in a cache lookup failure within
pg_get_triggerdef(). This isn't a new problem (note that crake's
failure is on v12 not HEAD) but it seems that it's gotten much more
probable of late, probably because of unrelated test changes
altering the timing of these tests.

It's fairly annoying that the triggers view could be prone to failing due
to concurrent DDL, but I don't see any near-term fix for that general
problem --- as long as the ruleutils infrastructure relies on syscache
lookups in any way, we're going to have some problems of that sort.
But what's *really* annoying is that the view can fail due to DDL
changes that aren't even on any of the tables being looked at. That's
because the planner is failing to push down the WHERE condition, meaning
that this isn't only a regression instability issue but a rather severe
performance problem: there is no way for a query on the triggers view
to not evaluate the entire view.

The reason for the pushdown failure is that the triggers view uses
a window function to compute its action_order output:

rank() OVER (PARTITION BY n.oid, c.oid, em.num, t.tgtype & 1, t.tgtype & 66 ORDER BY t.tgname)

and the planner can't push the restriction on relname below that for fear
of changing the window function's results. It does know that restrictions
on the PARTITION BY columns can be pushed down, but the query restriction
is not on any of those columns.

However, it's not like the query restriction is unrelated to that
partition condition, it's just not quite the same thing.
I experimented with

- rank() OVER (PARTITION BY n.oid, c.oid, em.num, t.tgtype & 1, t.tgtype & 66 ORDER BY t.tgname)
+ rank() OVER (PARTITION BY n.nspname::sql_identifier, c.relname::sql_identifier, em.num, t.tgtype & 1, t.tgtype & 66 ORDER BY t.tgname)

so as to make the first two partitioning columns actually match the
relevant view output columns, and lo, the planner now successfully
pushes the table-name constraint down to the pg_class relation scan.
(With a little more work, we could make the other partitioning columns
also match view output columns exactly, but I'm not sure that any of
those are worth messing with. Restrictions on the table and schema
names seem like the only things likely to be interesting for
performance.)

This should not only fix the regression instability but offer a pretty
significant speedup for real-world use of the triggers view, so I
propose squeezing it in even though we're past feature freeze.

However ... what about the back branches? As noted, we are seeing
this now in the back branches, at least v12. Are we willing to make
a change in the information_schema in back branches to make the
instability go away? (And if not, how else could we fix it?)

I note that this is actually a performance regression in the triggers
view: before v11, when this rank() call was added, the planner had
no problem pushing down restrictions on the table name.

I think that a reasonable case could be made for back-patching this
change as far as v11, though of course without a catversion bump.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-22 22:40:07 Re: More efficient RI checks - take 2
Previous Message Robert Haas 2020-04-22 21:43:40 Re: More efficient RI checks - take 2