Re: select distinct and index usage

From: "David Wilson" <david(dot)t(dot)wilson(at)gmail(dot)com>
To: "Alban Hertroys" <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: select distinct and index usage
Date: 2008-04-07 18:27:42
Message-ID: e7f9235d0804071127n5cd08096qec8653c905550f1e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Apr 7, 2008 at 1:42 PM, Alban Hertroys
<dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> wrote:
>
> Have you tried this query with enable_seqscan=off? If my guess is right
> (and the planners, in that case) it'd be even slower.

set enable_seqscan=off;
explain select distinct datestamp from vals;
QUERY PLAN
----------------------------------------------------------------------------------------------
Unique (cost=115003047.47..115380004.83 rows=4263 width=4)
-> Sort (cost=115003047.47..115191526.15 rows=75391472 width=4)
Sort Key: datestamp
-> Seq Scan on vals (cost=100000000.00..101531261.72
rows=75391472 width=4)

It appears to be doing a sequential scan regardless of the set, as if
it doesn't believe it can use the index for some reason
>
> Something that might help you, but I'm not sure whether it might hurt the
> performance of other queries, is to cluster that table on val_datestamp_idx.
> That way the records are already (mostly) sorted on disk in the order of the
> datestamps, which seems to be the brunt of above query plan.

That's a good thought. I'll give that a try this evening when the DB
has some downtime and see what happens.

>
> There seems to be quite a bit of overlap in your index definitions. From my
> experience this can confuse the planner.
>
> I suggest you combine them, but not knowing your data... Maybe rewriting
> your UNIQUE constraint to (val_dur, datestamp, eid, sid) would be enough to
> replace all those other indexes.
> If not, it's probably better to have one index per column, so that the
> planner is free to combine them as it sees fit. That'd result in a bitmap
> index scan, btw.

I can take a look at the other indices again, but those are all in
place for specific other queries that generally involve some set of
a=1, b=2, c=3, datestamp>5 type of where-clause and were created
specifically in response to sequential scans showing up in other
queries (and had the proper effect of fixing them!)
>
> I'm not a postgresql tuning expert (I know my way around though), other
> people can explain you way better than I can. Bruce Momjian for example:
> http://www.linuxjournal.com/article/4791

I'll take a look at that, thanks.

> That calculation doesn't look familiar to me, I'm more used to:
> select pg_size_pretty(pg_relation_size('...'));
>
> You can put the name of any relation in there, be it tables, indexes, etc.
>
> 11GB is pretty large for an index on an integer column, especially with
> only 75M rows: that's 146 bytes/row in your index. Maybe your index got
> bloated somehow? I think it should be about a tenth of that.

pg_total_relation_size('..') gives the number of bytes for the table +
all associated indices; pg_relation_size('..') gives for just the
table. The difference between the two should be total bytes take up by
the 5 total indices (11 total index cols), giving a
back-of-the-envelope estimation of 1gb for the size of the datestamp
index. I am fairly certain that I didn't give pg 1gb to fit the index
in memory, so I'll try upping its total available memory tonight and
see if that doesn't improve things.

I appreciate the responses so far! I'm used to several minutes for
some of the complex queries on this DB, but 12.5 minutes for a select
distinct just seems wrong. :)

--
- David T. Wilson
Princeton Satellite Systems
david(dot)t(dot)wilson(at)gmail(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2008-04-07 18:31:13 Re: Subtracting Two Intervals
Previous Message Terry Lee Tucker 2008-04-07 18:11:26 Subtracting Two Intervals