Re: Equivalence Classes when using IN

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Kim Rose Carlsen <krc(at)hiper(dot)dk>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Equivalence Classes when using IN
Date: 2017-10-11 22:12:01
Message-ID: CAKJS1f8dVWnpt4Hws65A7uSs4xfrg4aMLJfqLRCtqD+mt3whEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12 October 2017 at 10:15, Kim Rose Carlsen <krc(at)hiper(dot)dk> wrote:
> Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole table on each loop.
>
> (See original post and follow up for definitions)
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1)
> Join Filter: (c.customer_id = product.customer_id)
> Rows Removed by Join Filter: 199900
> -> Nested Loop (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1)
> -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1)
> Filter: (age < 20)
> Rows Removed by Filter: 901
> -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100)
> Index Cond: (customer_id = customer.customer_id)
> Heap Fetches: 100
> -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100)
> -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1)
> Sort Key: product.customer_id, product.product_id
> Sort Method: quicksort Memory: 142kB
> -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1)
> Planning time: 0.214 ms
> Execution time: 35.284 ms

I don't really see any blockers that would mean we couldn't support
this, it's just that we don't currently support it. The predicates
that we do pushdown are just ones we deem as safe to pushdown of the
ones that appear in the query, or ones that can be derived through
equivalence. (e.g. ab.a = ab.b and ab.b = 1 --> ab.a = 1)

For example, consider the difference between the following:

create table ab(a int, b int);
insert into ab select x,x from generate_series(1,1000000)x;
create index on ab(a);
create index on ab(b);

postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.b < 10;
QUERY PLAN
-----------------------------------------------------------------------------------
Subquery Scan on ab (cost=127757.34..145257.34 rows=333333 width=8)
Filter: (ab.b < 10)
-> Unique (cost=127757.34..132757.34 rows=1000000 width=8)
-> Sort (cost=127757.34..130257.34 rows=1000000 width=8)
Sort Key: ab_1.a, ab_1.b
-> Seq Scan on ab ab_1 (cost=0.00..14425.00
rows=1000000 width=8)
(6 rows)

postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.a < 10;
QUERY PLAN
-------------------------------------------------------------------------------
Unique (cost=8.73..8.77 rows=9 width=8)
-> Sort (cost=8.73..8.75 rows=9 width=8)
Sort Key: ab.a, ab.b
-> Index Scan using ab_a_idx on ab (cost=0.42..8.58 rows=9 width=8)
Index Cond: (a < 10)
(5 rows)

The "a < 10" was pushed down as we're distinct on (a), but pushing
down "ab.b < 10" would be invalid and could cause wrong results.

The predicate you'd like to see pushed down is actually a parameter in
a parameterized Path and we don't currently generate any parameterized
paths outside of each query level. Likely there's no good reason for
this other than it's not been done yet, but it's really only been
since 9.6 that the query planner has been flexible enough to possibly
allow something like this to be done at all.

The reason the planner may appear to push down the predicate when
there's no DISTINCT ON clause is that the planner was able to pull the
subquery (or view) up a level. When the planner is able to do this
it's much more flexible to the types of plans it can generate. It's
just that we don't ever pull up subqueries with DISTINCT ON, plus a
bunch of other reasons.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2017-10-12 02:25:43 Re: BDR, wal sender, high system cpu, mutex_lock_common
Previous Message Kim Rose Carlsen 2017-10-11 21:15:33 Re: Equivalence Classes when using IN