Re: How to optimize WHERE column_a IS NOT NULL OR column_b = 'value'

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Andreas Joseph Krogh <andreas(at)visena(dot)com>
Cc: postgres list <pgsql-sql(at)postgresql(dot)org>
Subject: Re: How to optimize WHERE column_a IS NOT NULL OR column_b = 'value'
Date: 2014-08-11 20:29:17
Message-ID: CAFj8pRD3TFoKssHKQQ9sH_S5U+_X95hissF-CbeZq0hYBP5QMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

2014-08-11 22:19 GMT+02:00 Andreas Joseph Krogh <andreas(at)visena(dot)com>:

> På mandag 11. august 2014 kl. 21:34:57, skrev Pavel Stehule <
> pavel(dot)stehule(at)gmail(dot)com>:
>
> Hi
>
> 2014-08-11 18:01 GMT+02:00 Andreas Joseph Krogh <andreas(at)visena(dot)com>:
>>
>> Hi folks,
>>
>> I have the following schema (simplified for this example).
>>
>> create table folder(id integer primary key, name varchar not null);
>>
>> create table document(id serial primary key, name varchar not null,
>> owner_id integer not null, folder_id integer references folder(id));
>> create index document_owner_idx ON document(owner_id);
>> create index document_folder_idx ON document(folder_id);
>>
>> insert into folder(id, name) values(1, 'Folder A');
>> insert into folder(id, name) values(2, 'Folder B');
>> insert into document(name, owner_id, folder_id) values('Document A', 1,
>> 1);
>> insert into document(name, owner_id, folder_id) values('Document B', 1,
>> NULL);
>> insert into document(name, owner_id, folder_id) values('Document C', 2,
>> 2);
>> insert into document(name, owner_id, folder_id) values('Document D', 2,
>> NULL);
>>
>> select f.id, f.name, doc.id, doc.owner_id, doc.name
>> FROM document doc left outer join folder f ON doc.folder_id = f.id
>> WHERE doc.folder_id is not null OR doc.owner_id = 1;
>>
>> QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------------------
>> Nested Loop Left Join (cost=0.15..13.77 rows=4 width=76) (actual
>> time=0.031..0.045 rows=3 loops=1)
>> -> Seq Scan on document doc (cost=0.00..1.05 rows=4 width=44)
>> (actual time=0.012..0.018 rows=3 loops=1)
>> Filter: ((folder_id IS NOT NULL) OR (owner_id = 1))
>> Rows Removed by Filter: 1
>> -> Index Scan using folder_pkey on folder f (cost=0.15..3.17 rows=1
>> width=36) (actual time=0.005..0.006 rows=1 loops=3)
>> Index Cond: (doc.folder_id = id)
>> Planning time: 0.267 ms
>> Execution time: 0.094 ms
>> (8 rows)
>>
>>
>> Is the a way to write a query which uses an index efficiently for such a
>> schema?
>>
>> I'd like to eliminate the Filter: ((folder_id IS NOT NULL) OR (owner_id =
>> 1)) and rather have "index cond" insted, is that possible?
>>
>
> your example is partially broken - ANALYZE and hashjoin and seqscan
> penalization are missing - index scan is not used due too small table sizes
>
> I tested 9.5, probably same as 9.4 and there indexes are used
>
> postgres=# set enable_hashjoin to off;
> SET
> Time: 0.473 ms
> postgres=# set enable_seqscan to off;
> SET
> Time: 0.904 ms
> postgres=# explain select f.id, f.name, doc.id, doc.owner_id, doc.name
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
> QUERY
> PLAN
>
> ────────────────────────────────────────────────────────────────────────────────────────────────
> Merge Left Join (cost=0.26..24.38 rows=3 width=32)
> Merge Cond: (doc.folder_id = f.id)
> -> Index Scan using document_folder_idx on document doc
> (cost=0.13..12.20 rows=3 width=23)
> Filter: ((folder_id IS NOT NULL) OR (owner_id = 1))
> -> Index Scan using folder_pkey on folder f (cost=0.13..12.16 rows=2
> width=13)
> Planning time: 0.663 ms
> (6 rows)
>
> default 9.2, 9.3, ...
>
> postgres=# explain select f.id, f.name, doc.id, doc.owner_id, doc.name
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
> QUERY PLAN
> ─────────────────────────────────────────────────────────────────────
> Hash Left Join (cost=1.04..2.12 rows=3 width=32)
> Hash Cond: (doc.folder_id = f.id)
> -> Seq Scan on document doc (cost=0.00..1.05 rows=3 width=23)
> Filter: ((folder_id IS NOT NULL) OR (owner_id = 1))
> -> Hash (cost=1.02..1.02 rows=2 width=13)
> -> Seq Scan on folder f (cost=0.00..1.02 rows=2 width=13)
> (6 rows)
>
> and 9.2 after hashjoin and indexscan penalization
>
> postgres=# explain select f.id, f.name, doc.id, doc.owner_id, doc.name
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
> QUERY
> PLAN
>
> ────────────────────────────────────────────────────────────────────────────────────────────────
> Merge Left Join (cost=0.00..24.62 rows=3 width=32)
> Merge Cond: (doc.folder_id = f.id)
> -> Index Scan using document_folder_idx on document doc
> (cost=0.00..12.32 rows=3 width=23)
> Filter: ((folder_id IS NOT NULL) OR (owner_id = 1))
> -> Index Scan using folder_pkey on folder f (cost=0.00..12.28 rows=2
> width=13)
> (5 rows)
>
> Time: 2.258 ms
>
> What is your PostgreSQL?
>
> Regards
>
> Pavel
>
> P.S. ten years ago I had a similar issue - "OR" predikates can be replaced
> by UNION
>
> you can try:
>
> SELECT * FROM
> (SELECT * FROM doc
> WHERE folder_id IS NOT NULL
> UNION
> SELECT * FROM doc
> WHERE owner_id = 1) s
> LEFT JOIN folder ON s.folder_id = folder.id
>
> or some similar magic
>
> select f.id, f.name, doc.id, doc.owner_id, doc.name
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
>
>
> I see turning enable_seqscan to off results in BitmapOr in 9.3:
>
> loff=# select version();
>
> version
>
> -------------------------------------------------------------------------------------------------------------
> PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu, compiled by gcc
> (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1, 64-bit
> (1 row)
> loff=# show enable_seqscan ;
> enable_seqscan
> ----------------
> off
> (1 row)
> loff=# explain analyze select f.id, f.name, doc.id, doc.owner_id,
> doc.name
>
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------
> Hash Left Join (cost=103.08..141.89 rows=1095 width=76) (actual
> time=0.052..0.057 rows=3 loops=1)
>
> Hash Cond: (doc.folder_id = f.id)
> -> Bitmap Heap Scan on document doc (cost=21.10..44.85 rows=1095
> width=44) (actual time=0.027..0.028 rows=3 loops=1)
> Recheck Cond: ((folder_id IS NOT NULL) OR (owner_id = 1))
> -> BitmapOr (cost=21.10..21.10 rows=1100 width=0) (actual
> time=0.018..0.018 rows=0 loops=1)
> -> Bitmap Index Scan on document_folder_idx
> (cost=0.00..16.36 rows=1094 width=0) (actual time=0.012..0.012 rows=2
> loops=1)
> Index Cond: (folder_id IS NOT NULL)
> -> Bitmap Index Scan on document_owner_idx
> (cost=0.00..4.20 rows=6 width=0) (actual time=0.005..0.005 rows=2 loops=1)
> Index Cond: (owner_id = 1)
> -> Hash (cost=66.60..66.60 rows=1230 width=36) (actual
> time=0.011..0.011 rows=2 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 1kB
> -> Index Scan using folder_pkey on folder f (cost=0.15..66.60
> rows=1230 width=36) (actual time=0.005..0.006 rows=2 loops=1)
> Total runtime: 0.125 ms
> (13 rows)
>

there is very bad estimation, so some in data is strange

-> Bitmap Index Scan on document_folder_idx
(cost=0.00..16.36 rows=1094 width=0) (actual time=0.012..0.012 rows=2
loops=1)
Index Cond: (folder_id IS NOT NULL)

ensure fresh statistics

>
>
>
> In 9.4-beta2 it results in an index-scan with a filter:
>
> andreak=# select version();
> version
>
> ---------------------------------------------------------------------------------------------------------
> PostgreSQL 9.4beta2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu
> 4.8.2-19ubuntu1) 4.8.2, 64-bit
> (1 row)
> andreak=# show enable_seqscan ;
> enable_seqscan
> ----------------
> off
> (1 row)
> andreak=# explain analyze select f.id, f.name, doc.id, doc.owner_id,
> doc.name
>
> FROM document doc left outer join folder f ON doc.folder_id = f.id
> WHERE doc.folder_id is not null OR doc.owner_id = 1;
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop Left Join (cost=0.28..18.92 rows=4 width=76) (actual
> time=0.032..0.044 rows=3 loops=1)
> -> Index Scan using document_folder_idx on document doc
> (cost=0.13..6.20 rows=4 width=44) (actual time=0.018..0.024 rows=3 loops=1)
>
> Filter: ((folder_id IS NOT NULL) OR (owner_id = 1))
> Rows Removed by Filter: 1
> -> Index Scan using folder_pkey on folder f (cost=0.15..3.17 rows=1
> width=36) (actual time=0.003..0.004 rows=1 loops=3)
>
> Index Cond: (doc.folder_id = id)
> Planning time: 0.260 ms
>
> Execution time: 0.094 ms
> (8 rows)
>
>
> I have quite large dataset (and some additional joins) in my prod-data and
> hoped that I could solve this with one index-scan on an index on
> document-table to avoid 2 index-scans and OR-ing the results.
>
> Thanks for help!
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andreas(at)visena(dot)com
> www.visena.com
> <https://www.visena.com>
>
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Shane Carey 2014-08-15 12:54:37 Stack builder
Previous Message Andreas Joseph Krogh 2014-08-11 20:19:00 Re: How to optimize WHERE column_a IS NOT NULL OR column_b = 'value'