Re: INSERT/UPDATE statements sometimes choosing not to use PK index?

From: Matt <bsg075(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: INSERT/UPDATE statements sometimes choosing not to use PK index?
Date: 2013-10-29 14:37:36
Message-ID: CAF=YoALaNPPAej4YYaWbvdMCcM8XMJWATBFP_+dEtdg9pmfepQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Since this is an upsert, the staging table simply accepts a copy of
pipe-delimited data via a COPY FROM STDIN, and a subsequent SQL script
performs the UPDATE/INSERT. The staging table is then truncated (or delete
without where) for the next run: Truncate staging, COPY into staging,
update then insert destination from staging, repeat on next 5 minute clock
cycle.

Since the staging table is fully loaded into the destination table, a full
scan was not a concern, but the lack of index usage on the destination. For
example, this is normally supported by the PK index, and performs well:

INSERT INTO destination (…)
SELECT (…) FROM staging
WHERE NOT EXISTS (SELECT * FROM source WHERE source.id = destination.id)

The above runs in expected time when the explain plan shows an index scan
on the destination primary key index, and a seq scan on the staging table.
This will continue for many runs, until something causes the engine to stop
using the destination PK index, and begin scanning both the destination and
staging. What triggers this is the interesting question.The only system
event I can (loosely) correlate with the problem is the start of a nightly
pg_dump, but as a read-only process, this would not be changing any index
stats, correct?

An ANALYZE on staging is possible after each load cycle, but time does not
permit on the destination table. I have been focusing on the destination
because it has the index that is not being used. Will stats on the staging
table affect index selection on the destination in a case like this?

In the process of attempting to diagnose this, both tables involved have
been vacuumed (full), analyzed. I have also moved staging to an SSD volume,
and created an equivalent index on staging - which is not used in any plan,
nor do I expect to as there is no filter criteria on staging, and the index
maintenance on staging would seem to be an unneeded overhead. But in this
case, is there any advantage to an index on staging?

For reference, it is possible (not enough test cycles to verify) that left
anti-join makes this situation worse, even though the explain plans appear
identical:

INSERT INTO destination (…)
SELECT (…) FROM staging
LEFT JOIN destination ON destination.id = staging.id
WHERE destination.id IS NULL

On 29 Oct 2013, at 9:45, Tom Lane wrote:

> Matt <bsg075(at)gmail(dot)com> writes:
>> In most cases, EXPLAIN and runtime tell me the index is utilized.
However,
>> sometime on back to back runs (5 minute intervals) the runtime explodes
and
>> EXPLAIN shows me that the PK index is not used, as both the staging table
>> and destination table a sequentially scanned.
>
> You haven't explained what you do with the "staging table", but maybe you
> need to ANALYZE it after you've loaded fresh data into it. The stats for
> the main table are presumably somewhat stable, but the stats for the
> staging table maybe not. The apparent correlation to consecutive runs
> lends some support to this theory, as that would reduce the time window
> for auto-ANALYZE to fix it for you.
>
> regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Kevin Grittner 2013-10-29 15:40:37 Re: Replication and fsync
Previous Message Merlin Moncure 2013-10-29 14:06:22 Re: json datatype and table bloat?