Re: Seq scan on big table, episode 2

From: Vincenzo Melandri <vmelandri(at)imolinfo(dot)it>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Seq scan on big table, episode 2
Date: 2012-10-31 11:46:59
Message-ID: CAHSd9Gdxi7HX4i2uXdV2WJ5vYPhC6bc1gCPayJbHhFnMFKmgEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I may (or may not) have found the solution: a reindex on the 3 tables fixed
the query plan. Now I can plan to reindex only the involved indexes at the
start of the data import procedure.

On Wed, Oct 31, 2012 at 11:55 AM, Vincenzo Melandri
<vmelandri(at)imolinfo(dot)it>wrote:

> Hi all :)
>
> I'm here again.
> This time I'll provide more details (explain analyze, data-type, and
> indexes), hope it will be enough :)
>
> The query that is performing a plan that i do not understand is the
> following:
> --------------------
> select [some fields from all 3 tables]
> from
> DATA_SEQUENCES
> join SUBSCRIPTION on
> SUBSCRIPTION.key1 = DATA_SEQUENCES.key1 AND
> SUBSCRIPTION.key2 = DATA_SEQUENCES.key2
> join people on
> people.key1 = SUBSCRIPTION.people_key1 AND
> people.key2 = SUBSCRIPTION.people_key2
> WHERE DATA_SEQUENCES.import_id = 1351674661
> --------------------
>
> This is the explain analyze:
>
> --------------------
> Merge Join (cost=2902927.01..2973307.79 rows=790371 width=240) (actual
> time=40525.439..40525.439 rows=0 loops=1)
> Merge Cond: ((people.key1 = subscription.people_key1) AND (people.key2 =
> subscription.people_key2))
> -> Sort (cost=2885618.73..2904468.49 rows=7539905 width=240) (actual
> time=40525.268..40525.268 rows=1 loops=1)
> Sort Key: people.key1, people.key2
> Sort Method: external merge Disk: 466528kB
> -> Seq Scan on people (cost=0.00..323429.05 rows=7539905
> width=240) (actual time=0.029..5193.057 rows=7539469 loops=1)
> -> Sort (cost=17308.28..17318.76 rows=4193 width=16) (actual
> time=0.167..0.167 rows=0 loops=1)
> Sort Key: subscription.people_key1, subscription.people_key2
> Sort Method: quicksort Memory: 25kB
> -> Nested Loop (cost=0.00..17055.99 rows=4193 width=16) (actual
> time=0.154..0.154 rows=0 loops=1)
> -> Seq Scan on data_sequences (cost=0.00..150.15 rows=39
> width=16) (actual time=0.154..0.154 rows=0 loops=1)
> Filter: (import_id = 1351674661)
> -> Index Scan using xpksubscription on subscription
> (cost=0.00..431.86 rows=108 width=16) (never executed)
> Index Cond: ((subscription.key1 = data_sequences.key1)
> AND (subscription.key2 = data_sequences.key2))
> Total runtime: 40600.815 ms
> --------------------
>
> All the key, key2, and relative foreign keys are int4. Import_id is a
> bigint.
> I'm not reporting the full create table script 'cause people and
> subscription both have lots of fields. I know this can be wrong (lots of
> field on big table), but this is an environment born something like 20
> years ago and not intended from the start for such a big data volume.
>
> I have the following indexes:
>
> on People:
> CREATE UNIQUE INDEX people_pkey ON people USING btree (key1, key2)
> CREATE INDEX people_pkey_hash_loc ON people USING hash (key1);
> CREATE INDEX people_pkey_hash_id ON people USING hash (key2);
>
> on Subscription:
> CREATE UNIQUE INDEX subscription_pkey ON subscription USING btree (key1,
> key2)
> CREATE INDEX subscription_fk_people ON subscription USING btree
> (people_key1, people_key2)
>
> on Data_sequences:
> create index data_sequences_key on data_sequences USING btree (key1, key2);
> create index data_sequences_id on data_sequences USING btree (import_id);
>
> What i don't understand is WHY the seq scan on people, and how can I cast
> the import_id to make it use the index on data_sequences (another useless
> seq scan).
> Mind that when I run this explain analyze there were no records on
> data_sequences. So all the time (40 seconds!) is for the useless seq scan
> on people. Both people and subscription have lots of records (10-20.000.000
> range).
> I'm running 8.4 (haven't tested it on 9.2.1 yet, but we planned to upgrade
> ASAP cause we have other queries which will benefit from the
> index-only-scan new feature).
>
>
> Thank you in advance,
> --
> Vincenzo.
>

--
Vincenzo.
Imola Informatica

Ai sensi del D.Lgs. 196/2003 si precisa che le informazioni contenute in
questo messaggio sono riservate ed a uso esclusivo del destinatario.
Pursuant to Legislative Decree No. 196/2003, you are hereby informed that
this message contains confidential information intended only for the use of
the addressee.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Shaun Thomas 2012-10-31 13:04:55 Re: Seq scan on big table, episode 2
Previous Message AndyG 2012-10-31 11:28:12 Re: Slow query, where am I going wrong?