How to hint 2 coulms IS NOT DISTINCT FROM each other

From: Kim Rose Carlsen <krc(at)hiper(dot)dk>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: How to hint 2 coulms IS NOT DISTINCT FROM each other
Date: 2016-10-28 16:29:26
Message-ID: AM4PR0501MB26103A125CD2C12DFE48CECBC7AD0@AM4PR0501MB2610.eurprd05.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' .

The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals.

Example:

CREATE TABLE a (
id INTEGER PRIMARY KEY,
key VARCHAR,
value VARCHAR

);

CREATE INDEX ON a (key);

INSERT INTO a
VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz');

CREATE VIEW view_a AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
);

CREATE VIEW view_a_eq AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key
);

CREATE VIEW view_a_distinct AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key IS NOT DISTINCT FROM table_b.key
);

EXPLAIN SELECT * FROM view_a WHERE key = 'test';

QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.42 rows=4 width=100)
Hash Cond: (table_b.id = table_a.id)
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)

We only get index scan on table_a

EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test';

QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=8.36..25.53 rows=1 width=100)
Join Filter: (table_a.id = table_b.id)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
-> Materialize (cost=4.18..12.66 rows=4 width=36)
-> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
We get index scan on both tables and the where clause is pushed all the way down

EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test';

QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.43 rows=1 width=100)
Hash Cond: (table_b.id = table_a.id)
Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text))
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
Same as the first example

In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables.

It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls.

SELECT * FROM view_a_eq WHERE key IS NULL;
id | key | value | b_key
----+-----+-------+-------
(0 rows)

Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Kim Rose Carlsen 2016-10-28 16:35:00 Re: How to hint two columns IS NOT DISTINCT FROM each other in a join
Previous Message Alban Hertroys 2016-10-28 15:47:55 Re: WHERE ... IN condition and multiple columns in subquery