index being ignored for "limit n" queries

From: Piotr Sulecki <Piotr(dot)Sulecki(at)sybilla(dot)traxelektronik(dot)pl>
To: pgsql-general(at)postgresql(dot)org
Subject: index being ignored for "limit n" queries
Date: 2005-08-09 12:12:49
Message-ID: 42F89DC1.9040303@sybilla.traxelektronik.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ave!

Yesterday I noticed a problem with my PostgreSQL installation. I have
three database clusters, one using version 7.4.6 and the rest using
version 8.0.1. The problem manifests itself in all three installations.
(The three databases are copies of -- more or less -- the same data; now
I'm writing a program to really synchronize these databases, but that's
besides the point. I'll only describe one of them; the same can be said
of the others, too.)

I have two really big tables, the problem is with one of them. The table
looks as follows:

Table "public.pakiety"
Column | Type | Modifiers
-------------+-----------------------------+---------------------------------------------
pktid | integer | not null default
nextval('pktid_seq')
stid | smallint | not null
received | timestamp(6) with time zone | not null
measured | timestamp(0) with time zone | not null
station_time | timestamp(0) with time zone |
bezwzgl | smallint |
full_cycle | boolean |

Indexes:
"pakiety_pkey" PRIMARY KEY, btree (pktid)
"pakiety_stid_received_idx" UNIQUE, btree (stid, received)
"pakiety_measured_idx" btree (measured)
"pakiety_received_idx" btree (received)
"pakiety_stid_measured_idx" btree (stid, measured)

This table holds info about data packets sent by our automated meteo
stations.

The table contains about 15 million rows. Any type of a sequential scan
will be extremely slow on this table. Even an index scan will take
minutes if the index is not fit for the search.

What's unusual is that:
1) there is only a very limited set of used stid (station ID) values -
about 500 or so, and
2) there can be some very long periods of time (measured in months) when
a specific station doesn't send any packets at all.

That's why I created indices pakiety_stid_received_idx and
pakiety_stid_measured_idx. Unfortunately, when I'm looking for the last
data packet from a specific station, I get an index scan using only the
"pakiety_received_idx" index and not the "pakiety_stid_received_idx"
one. While it takes a split second for a recently active station, it
really takes several minutes for a station which has been quiet for
several months.

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..6.26 rows=1 width=33)
-> Index Scan Backward using pakiety_received_idx on pakiety
(cost=0.00..193599.37 rows=30915 width=33)
Filter: (stid = 234::smallint)
(3 rows)

What's funny, I get the same plan when using "limit 10000":

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 10000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..62621.32 rows=10000 width=33)
-> Index Scan Backward using pakiety_received_idx on pakiety
(cost=0.00..193756.63 rows=30941 width=33)
Filter: (stid = 234::smallint)
(3 rows)

I have just noticed another problem: when the limit goes even higher,
the planner decides to use another unfit index.

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 100000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (cost=125101.46..125178.81 rows=30940 width=33)
-> Sort (cost=125101.46..125178.81 rows=30940 width=33)
Sort Key: received
-> Index Scan using pakiety_stid_measured_idx on pakiety
(cost=0.00..122288.52 rows=30940 width=33)
Index Cond: (stid = 234::smallint)
(5 rows)

The same index is also used if I remove the "limit n" clause entirely:

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Sort (cost=125293.11..125370.58 rows=30986 width=33)
Sort Key: received
-> Index Scan using pakiety_stid_measured_idx on pakiety
(cost=0.00..122474.14 rows=30986 width=33)
Index Cond: (stid = 234::smallint)
(4 rows)

The "pakiety_stid_received_idx" index never gets used. Any hints about
possible reasons of such behavior?

Regards,

Piotr Sulecki.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2005-08-09 12:31:16 Referencing "less-unique" foreign keys
Previous Message Richard Huxton 2005-08-09 11:38:52 Re: Case sensitivity