More correlated (?) index woes

From: Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: More correlated (?) index woes
Date: 2016-03-28 19:23:29
Message-ID: CAEzk6fex6MRdMfGiiaJyA1U=Q+f0qF=G4iESXvHtaHLb0dfRpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

So I accept that when using MIN(sc_id) against scdate it makes statistical
sense to use the sc_id index for a reasonable percentage of the full range
of scdate, unless we know in advance that scdate is closely correlated to
sc_id (because using MIN means we can stop immediately we hit a value).

However I'm now finding a similar problem when using a multi-table DELETE,
where the same obviously can't apply.

This query:

DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND 20160222
AND legs.sc_id=pa.sc_id;

does what one would hope:

Delete on pa (cost=0.99..705406.24 rows=36570 width=12)
-> Nested Loop (cost=0.99..705406.24 rows=36570 width=12)
-> Index Scan using legs_scdate_idx on legs (cost=0.43..1171.88
rows=14458 width=10)
Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
-> Index Scan using pa_pkey on pa (cost=0.56..48.33 rows=38
width=10)
Index Cond: (sc_id = legs.sc_id)

However as soon as I add an extra test for field1 IS NULL, it apparently
goes insane:

Delete on pa (cost=577260.90..626296.46 rows=23732 width=12) (actual
time=41870.770..41870.770 rows=0 loops=1)
-> Hash Join (cost=577260.90..626296.46 rows=23732 width=12) (actual
time=37886.396..41315.668 rows=44960 loops=1)
Hash Cond: (legs.sc_id = pa.sc_id)
-> Index Scan using legs_scdate_idx on legs (cost=0.43..1171.88
rows=14458 width=10) (actual time=0.030..13.667 rows=21281 loops=1)
Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
-> Hash (cost=481691.12..481691.12 rows=5497868 width=10)
(actual time=37805.756..37805.756 rows=4875870 loops=1)
Buckets: 131072 Batches: 64 Memory Usage: 4311kB
-> Seq Scan on pa (cost=0.00..481691.12 rows=5497868
width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
Filter: (field1 IS NULL)
Rows Removed by Filter: 2688634
Planning time: 0.447 ms
Execution time: 41870.832 ms

Running ANALYZE makes no difference.

Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's
absolutely not reasonable to expect this to be an optimal strategy.

Any suggestions as to how I can improve this query?

Thanks :)

Geoff

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mat Arye 2016-03-28 20:41:00 Table size for partitioned setup
Previous Message Adrian Klaver 2016-03-28 13:45:03 Re: View deleted records in a table