Using an index for IS DISTINCT FROM queries

From: Steven Schlansker <steven(at)likeness(dot)com>
To: "pgsql-general(at)postgresql(dot)org General" <pgsql-general(at)postgresql(dot)org>
Subject: Using an index for IS DISTINCT FROM queries
Date: 2013-04-22 17:37:26
Message-ID: AEBD8B4A-D4DF-45C2-8B2D-7BBDA1C5B69F@likeness.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi everyone,
I have a large table (~150M rows) that keeps a version field. At any given time, it is expected that the vast majority of the rows are on the "current" version, but some may be missing.

To figure out all the missing our outdated values, I run a query along the lines of

SELECT id FROM source_table LEFT OUTER JOIN dest_table WHERE version IS DISTINCT FROM <current-version>

However, this query always selects a sequential scan and hash of both tables, which is *very* slow.

The statistics reflect that the planner knows that current-version is overwhelmingly common:

schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+-----------+------------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
public | event | unpacker_version | f | 0 | 4 | 1 | {1} | {1} | | 1

but it doesn't help:

event=> explain select count(1) from event where unpacker_version is distinct from 1;
QUERY PLAN
------------------------------------------------------------------
Aggregate (cost=10658841.93..10658841.94 rows=1 width=0)
-> Seq Scan on event (cost=0.00..10658841.93 rows=1 width=0)
Filter: (unpacker_version IS DISTINCT FROM 1)
(3 rows)

I can "by hand" force the planner to consider the obvious solution (find values below, above, and null) but it is ugly:

event=> explain select count(1) from event where unpacker_version < 1 or unpacker_version > 1 or unpacker_version is null;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Aggregate (cost=139.14..139.15 rows=1 width=0)
-> Bitmap Heap Scan on event (cost=135.13..139.14 rows=1 width=0)
Recheck Cond: ((unpacker_version < 1) OR (unpacker_version > 1) OR (unpacker_version IS NULL))
-> BitmapOr (cost=135.13..135.13 rows=1 width=0)
-> Bitmap Index Scan on event_unpacker_version_idx (cost=0.00..45.04 rows=1 width=0)
Index Cond: (unpacker_version < 1)
-> Bitmap Index Scan on event_unpacker_version_idx (cost=0.00..45.04 rows=1 width=0)
Index Cond: (unpacker_version > 1)
-> Bitmap Index Scan on event_unpacker_version_idx (cost=0.00..45.04 rows=1 width=0)
Index Cond: (unpacker_version IS NULL)
(10 rows)

but this sucks to do such tuning by munging the query. Is there some case where these are not equivalent? If they are equivalent, would this be a reasonable case for the planner to consider and optimize by itself?

Thanks,
Steven

Browse pgsql-general by date

  From Date Subject
Next Message Christophe Pettus 2013-04-22 17:41:13 Re: Need solution for weekly database "snapshot"
Previous Message hamann.w 2013-04-22 17:26:34 how can this get faster