Re: have: seq scan - want: index scan

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Chris Ruprecht <chris(at)cdrbill(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: have: seq scan - want: index scan
Date: 2012-10-17 09:08:15
Message-ID: CAEV0TzDtiE4yre0fYK5r1feWPCPigsX03N_fZbfpfsMiaK6Daw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Oct 16, 2012 at 4:45 PM, Chris Ruprecht <chris(at)cdrbill(dot)com> wrote:

> Hi guys,
>
> PG = 9.1.5
> OS = winDOS 2008R8
>
> I have a table that currently has 207 million rows.
> there is a timestamp field that contains data.
> more data gets copied from another database into this database.
> How do I make this do an index scan instead?
> I did an "analyze audittrailclinical" to no avail.
> I tested different indexes - no same behavior.
>
>
> The query plan says:
>
> " -> Seq Scan on audittrailclinical (cost=0.00..8637598.76
> rows=203856829 width=62)"
> " Filter: (("timestamp" >= '2008-01-01
> 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2012-10-05
> 00:00:00'::timestamp without time zone))"
>
> which takes forever.
>
> How do I make this do an index scan instead?
> I did an "analyze audittrailclinical" to no avail.
>

analyze says 203 million out of 207 million rows are matched by your
timestamp filter, so it is definitely going to favour a sequential scan,
since an index scan that matches that many rows will inevitably be slower
than simply scanning the table, since it will have to both do the lookups
and load the actual records from the table (all of them, basically) in
order to determine their visibility to you, so your index scan will just
turn sequential access of the table pages into random access and require
index lookups as well. You can possibly verify this by setting
enable_seqscan to false and running your analyze again and see how the plan
changes, though I don't believe that will necessarily remove all sequential
scans, it just reduces their likelihood, so you may see that nothing
changes. If the estimate for the number of matching rows is incorrect,
you'll want to increase the statistics gathering for that table or just
that column.

ALTER TABLE <table> ALTER COLUMN <column> SET STATISTICS <number>

where number is between 10 and 1000 and I think the default is 100. Then
re-analyze the table and see if the query plan shows better estimates. I
think 9.2 also supports "index only scans" which eliminate the need to load
the matched records in certain circumstances. However, all of the columns
used by the query would need to be in the index, and you are using an awful
lot of columns between the select clause and the table joins.

Are you lacking indexes on the columns used for joins that would allow more
selective index scans on those columns which could then just filter by
timestamp? I'm not much of an expert on the query planner, so I'm not sure
what exactly will cause that behaviour, but I'd think that good statistics
and useful indexes should allow the rest of the where clause to be more
selective of the rows from audittrailclinical unless
patientaccount.defaultaccount
= 'Y' and patient.dnsortpersonnumber = '347450' are similarly
non-selective, though patient.dnsortpersonnumber would seem like it is
probably the strong filter, so make sure you've got indexes and accurate
stats on all of the foreign keys that connect patient table and
audittrailclinical table. It'd be useful to see the rest of the explain
analyze output so we could see how it is handling the joins and why. Note
that because you have multiple composite foreign keys joining tables in
your query, you almost certainly won't those composite keys in a single
index. If you have indexes on those columns but they are single-column
indexes, that may be what is causing the planner to try to filter the atc
table on the timestamp rather than via the joins. I'm sure someone more
knowledgable than I will be along eventually to correct any misinformation
I may have passed along. Without knowing anything about your schema or the
rest of the explain analyze output, I'm mostly just guessing. There is an
entire page devoted to formulating useful mailing list questions,
incidentally. Yours really isn't. Or if the atc table definition is
complete, you are definitely missing potentially useful indexes, since you
are joining to that table via encountersid and you don't show an index on
that column - yet that is the column that eventually joins out to the
patient and patientaccount tables, which have the stronger filters on them.

Incidentally, why the join to the entity table via entitysid? No columns
from that table appear to be used anywhere else in the query.

--sam

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Maciek Sakrejda 2012-10-17 09:13:43 Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6
Previous Message Martin French 2012-10-17 08:53:18 Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6