From: | "Strange, John W" <john(dot)w(dot)strange(at)jpmorgan(dot)com> |
---|---|
To: | "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Duplicate deletion optimizations |
Date: | 2012-01-07 00:02:01 |
Message-ID: | D86EB8F058615F40948C622359204B3F05460A2862@EMASC201VS01.exchad.jpmchase.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Are your stats updated on the table after you added the index?
- run the bad query with explain verbose on (you should send this anyways)
- check to see what the difference is in expected rows vs. actual rows
- make sure that your work_mem is high enough if you are sorting, if not you'll see it write out a temp file which will be slow.
- if there is different analyze the table and rerun the query to see if you get the expected results.
- I do believe having COUNT(*) > 1 will never use an index, but someone more experience can comment here.
-----Original Message-----
From: pgsql-performance-owner(at)postgresql(dot)org [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of antoine(at)inaps(dot)org
Sent: Friday, January 06, 2012 8:36 AM
To: pgsql-performance(at)postgresql(dot)org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,
I've a table with approximately 50 million rows with a schema like
this:
id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs.
So, I've decided to do it in two times:
- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates
Right now, my delete query look like this:
SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;
The duration of the query on my test machine with approx. 16 million rows is ~18s.
To reduce this duration, I've tried to add an index on my triplet:
CREATE INDEX test
ON stats_5mn
USING btree
(t_value , t_record , output_id );
By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s.
My questions:
- Why the planner refuse to use my index?
- Is there a better method for my problem?
Thanks by advance for your help,
Antoine Millet.
[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different
[2] http://explain.depesz.com/s/UzW :
GroupAggregate (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
Filter: (five(*) > 1)
-> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1)
Sort Key: delta, kilo, four
Sort Method: external merge Disk: 223512kB
-> Seq Scan on three (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)
[3] http://explain.depesz.com/s/o9P :
GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1)
Filter: (five(*) > 1)
-> Index Scan using charlie on three (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)
--
Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
From | Date | Subject | |
---|---|---|---|
Next Message | Misa Simic | 2012-01-07 03:28:06 | Re: Duplicate deletion optimizations |
Previous Message | Marc Eberhard | 2012-01-06 22:20:35 | Re: Duplicate deletion optimizations |