From: | Misa Simic <misa(dot)simic(at)gmail(dot)com> |
---|---|
To: | "antoine(at)inaps(dot)org" <antoine(at)inaps(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Duplicate deletion optimizations |
Date: | 2012-01-07 03:28:06 |
Message-ID: | -8455307354303926221@unknownmsgid |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
hi,
Maybe these thoughts could help....
1) order by those three columns in your select min query could force
index usage...
2) or
DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)
Sent from my Windows Phone
From: antoine(at)inaps(dot)org
Sent: 06/01/2012 15:36
To: pgsql-performance(at)postgresql(dot)org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,
I've a table with approximately 50 million rows with a schema like
this:
id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.
So, I've decided to do it in two times:
- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates
Right now, my delete query look like this:
SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;
The duration of the query on my test machine with approx. 16 million
rows is ~18s.
To reduce this duration, I've tried to add an index on my triplet:
CREATE INDEX test
ON stats_5mn
USING btree
(t_value , t_record , output_id );
By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.
My questions:
- Why the planner refuse to use my index?
- Is there a better method for my problem?
Thanks by advance for your help,
Antoine Millet.
[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
[2] http://explain.depesz.com/s/UzW :
GroupAggregate (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
Filter: (five(*) > 1)
-> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
Sort Key: delta, kilo, four
Sort Method: external merge Disk: 223512kB
-> Seq Scan on three (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)
[3] http://explain.depesz.com/s/o9P :
GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
Filter: (five(*) > 1)
-> Index Scan using charlie on three (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)
From | Date | Subject | |
---|---|---|---|
Next Message | Jochen Erwied | 2012-01-07 11:57:26 | Re: Duplicate deletion optimizations |
Previous Message | Strange, John W | 2012-01-07 00:02:01 | Re: Duplicate deletion optimizations |