From: | Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com> |
---|---|
To: | Denis de Bernardy <ddebernardy(at)yahoo(dot)com> |
Cc: | Fredrik Widlert <fredrik(dot)widlert(at)digpro(dot)se>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: How to avoid seq scans for joins between union-all views (test case included) |
Date: | 2011-05-13 14:48:38 |
Message-ID: | BANLkTimcBwULhY0uOuxgcd1jVTbqp=v9jQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
2011/5/13 Denis de Bernardy <ddebernardy(at)yahoo(dot)com>:
> I might have misread, but:
>
>> select * from connections where locked_by <> 4711
>> union all
>> select * from connections_locked where locked_by = 4711;
>
>
> The first part will result in a seq scan irrespective of indexes, and the second has no index on locked_by. The best you can do is to eliminate the seq scan on the second by adding the missing index on locked_by.
just rework the primary key to set the locked_id first should work.
>
> That said, note that index usage depends on your data distribution: postgres may identify that it'll read most/all of the table anyway, and opt to do a (cheaper) seq scan instead.
Fredrick, What indexes Oracle did choose ? (index-only scan ?)
>
> D
>
>
> ----- Original Message -----
>> From: Fredrik Widlert <fredrik(dot)widlert(at)digpro(dot)se>
>> To: pgsql-performance(at)postgresql(dot)org
>> Cc:
>> Sent: Friday, May 13, 2011 1:55 PM
>> Subject: [PERFORM] How to avoid seq scans for joins between union-all views (test case included)
>>
>> Hi everyone
>>
>> We have recently started to port an application from Oracle to PostgreSQL.
>> So far, we are amazed with how great most things work.
>>
>> However, we have run into performance problems in one type of query which
>> is quite common in our application. We have created a (simplified)
>> reproducible test case which (hopefully!) creates all necessary tables
>> and data to
>> show the problem.
>>
>> Plain-text description of the data model in the test case:
>>
>> We have a set of objects (like electrical cables), each having
>> two nodes in the table "connections" (think of these two rows together
>> as an edge in a graph).
>>
>> Another table "connections_locked" contains rows for some of
>> the same objects, which are locked by a long transaction.
>>
>> The view connections_v performs a union all of the rows from
>> "connections" which are not modified in the current long
>> transaction with the rows from "connections_locked" which
>> are modified in the current long transaction.
>>
>> Goal:
>> Given an object id, we want to find all neighbors for this
>> object (that is, objects which share a node with this object).
>>
>> Problem:
>> We think that our query used to find neighbors would benefit
>> greatly from using some of our indexes, but we fail to make it
>> do so.
>>
>>
>> Over to the actual test case:
>>
>> ----------------------------------------------
>>
>> -- Tested on (from select version ()):
>> -- PostgreSQL 9.0.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
>> 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
>> -- PostgreSQL 9.1beta1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
>> 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
>>
>> -- Ubuntu 11.04, uname -a output:
>> -- Linux <hostname> 2.6.38-8-generic-pae #42-Ubuntu SMP Mon Apr 11
>> 05:17:09 UTC 2011 i686 i686 i386 GNU/Linux
>> -- Processor: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz
>> -- Drive: Intel X25-M SSD
>>
>>
>> drop table if exists connections cascade;
>> drop table if exists connections_locked cascade;
>>
>>
>> create table connections (
>> con_id serial primary key,
>> locked_by integer not null,
>> obj_id integer not null,
>> node integer not null
>> );
>>
>>
>> -- create test nodes, two per obj_id
>> insert into connections (locked_by, obj_id, node)
>> select 0, n/2, 1000 + (n + 1)/2 from generate_series (1,500000) as n;
>>
>> create index connections_node_idx on connections (node);
>> create index connections_obj_idx on connections (obj_id);
>> vacuum analyze connections;
>>
>>
>>
>> create table connections_locked (
>> con_id integer not null,
>> locked_by integer not null,
>> obj_id integer not null,
>> node integer not null,
>> constraint locked_pk primary key (con_id, locked_by)
>> );
>>
>> -- mark a few of the objects as locked by a long transaction
>> insert into connections_locked (con_id, locked_by, obj_id, node)
>> select n, 1 + n/50, n/2, 1000 + (n + 1)/2 from generate_series (1,25000) as n;
>>
>> create index connections_locked_node_idx on connections_locked (node);
>> create index connections_locked_obj_idx on connections_locked (obj_id);
>> vacuum analyze connections_locked;
>>
>>
>> -- Create a view showing the world as seen by long transaction 4711.
>> -- In real life, this uses a session variable instead of a hard-coded value.
>> create or replace view connections_v as
>> select * from connections where locked_by <> 4711
>> union all
>> select * from connections_locked where locked_by = 4711;
>>
>>
>> -- This is the query we are trying to optimize.
>> -- We expect this to be able to use our indexes, but instead get
>> sequential scans
>> explain analyze
>> select
>> con2.obj_id
>> from
>> connections_v con1,
>> connections_v con2
>> where
>> con1.obj_id = 17 and
>> con2.node = con1.node
>> ;
>>
>>
>> -- Output:
>> -- Hash Join (cost=16.69..16368.89 rows=7501 width=4) (actual
>> time=0.096..778.830 rows=4 loops=1)
>> -- Hash Cond: ("*SELECT* 1".node = "*SELECT* 1".node)
>> -- -> Append (cost=0.00..14402.00 rows=500050 width=8) (actual
>> time=0.011..640.163 rows=500000 loops=1)
>> -- -> Subquery Scan on "*SELECT* 1" (cost=0.00..13953.00
>> rows=500000 width=8) (actual time=0.011..430.645 rows=500000 loops=1)
>> -- -> Seq Scan on connections (cost=0.00..8953.00
>> rows=500000 width=16) (actual time=0.009..178.535 rows=500000 loops=1)
>> -- Filter: (locked_by <> 4711)
>> -- -> Subquery Scan on "*SELECT* 2" (cost=0.00..449.00
>> rows=50 width=8) (actual time=3.254..3.254 rows=0 loops=1)
>> -- -> Seq Scan on connections_locked
>> (cost=0.00..448.50 rows=50 width=16) (actual time=3.253..3.253 rows=0
>> loops=1)
>> -- Filter: (locked_by = 4711)
>> -- -> Hash (cost=16.66..16.66 rows=3 width=4) (actual
>> time=0.028..0.028 rows=2 loops=1)
>> -- Buckets: 1024 Batches: 1 Memory Usage: 1kB
>> -- -> Append (cost=0.00..16.66 rows=3 width=4) (actual
>> time=0.013..0.025 rows=2 loops=1)
>> -- -> Subquery Scan on "*SELECT* 1"
>> (cost=0.00..8.35
>> rows=2 width=4) (actual time=0.013..0.016 rows=2 loops=1)
>> -- -> Index Scan using connections_obj_idx on
>> connections (cost=0.00..8.33 rows=2 width=16) (actual
>> time=0.012..0.014 rows=2 loops=1)
>> -- Index Cond: (obj_id = 17)
>> -- Filter: (locked_by <> 4711)
>> -- -> Subquery Scan on "*SELECT* 2"
>> (cost=0.00..8.30
>> rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=1)
>> -- -> Index Scan using connections_locked_obj_idx
>> on connections_locked (cost=0.00..8.29 rows=1 width=16) (actual
>> time=0.007..0.007 rows=0 loops=1)
>> -- Index Cond: (obj_id = 17)
>> -- Filter: (locked_by = 4711)
>>
>>
>>
>> -- Rewriting the query to an almost-equivalent form yields almost the
>> same result (that is, seq scans)
>> explain analyze
>> select
>> con2.obj_id
>> from
>> connections_v con2
>> where con2.node in (select node from connections_v con1 where
>> con1.obj_id = 17);
>>
>>
>> -- Simplifying the query even more to use a sub-select with a
>> hard-coded value still results in seq scans
>> explain analyze
>> select
>> con2.obj_id
>> from
>> connections_v con2
>> where con2.node in (select 1015);
>>
>>
>> -- Finally, when we simplify even more and just use a constant, we get
>> the index accesses we were hoping
>> -- for all along.
>> explain analyze
>> select
>> con2.obj_id
>> from
>> connections_v con2
>> where con2.node in (1015);
>>
>> -- Result (cost=0.00..16.66 rows=3 width=4) (actual time=0.048..0.079
>> rows=2 loops=1)
>> -- -> Append (cost=0.00..16.66 rows=3 width=4) (actual
>> time=0.047..0.076 rows=2 loops=1)
>> -- -> Subquery Scan on "*SELECT* 1" (cost=0.00..8.35
>> rows=2
>> width=4) (actual time=0.046..0.049 rows=2 loops=1)
>> -- -> Index Scan using connections_node_idx on
>> connections (cost=0.00..8.33 rows=2 width=16) (actual
>> time=0.046..0.048 rows=2 loops=1)
>> -- Index Cond: (node = 1015)
>> -- Filter: (locked_by <> 4711)
>> -- -> Subquery Scan on "*SELECT* 2" (cost=0.00..8.30
>> rows=1
>> width=4) (actual time=0.025..0.025 rows=0 loops=1)
>> -- -> Index Scan using connections_locked_node_idx on
>> connections_locked (cost=0.00..8.29 rows=1 width=16) (actual
>> time=0.024..0.024 rows=0 loops=1)
>> -- Index Cond: (node = 1015)
>> -- Filter: (locked_by = 4711)
>>
>>
>>
>> ------- end of test case -----
>>
>> Can someone explain what is happening here? Is there some way we can
>> rewrite our query or some setting we could turn on or off to get the
>> optimizer to choose to use our indexes?
>>
>> (testing with "set enable_seqscan = false;" does not make a difference
>> as far as we can see)
>>
>> To verify that we have really created all necessary indexes, we have
>> converted this simplified test case to Oracle syntax and tested it on
>> our Oracle server. In this case, we do get the expected index accesses,
>> so we think that we have in fact managed to isolate the problem using
>> this test case.
>>
>> What we are hoping for:
>> Since we have lots of queries joining these kind of "union all"-views
>> between a master table and a transaction table, we would be really
>> glad to hear something like "when you use these kinds of views, you
>> need to do X, Y and Z to get good performance".
>>
>> Thanks in advance for any help!
>> /Fredrik
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
From | Date | Subject | |
---|---|---|---|
Next Message | Fredrik Widlert | 2011-05-13 15:09:42 | Re: How to avoid seq scans for joins between union-all views (test case included) |
Previous Message | Denis de Bernardy | 2011-05-13 14:23:19 | Re: How to avoid seq scans for joins between union-all views (test case included) |