From: | Miguel Miranda <miguel(dot)mirandag(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | delete duplicates takes too long |
Date: | 2009-04-24 23:37:22 |
Message-ID: | aa6b08350904241637n4aabb598q88b00b4d51895010@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
hi , i hava a table:
CREATE TABLE public.cdr_ama_stat (
id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),
abonado_a varchar(30) NULL,
abonado_b varchar(30) NULL,
fecha_llamada timestamp NULL,
duracion int4 NULL,
puerto_a varchar(4) NULL,
puerto_b varchar(4) NULL,
tipo_llamada char(1) NULL,
processed int4 NULL,
PRIMARY KEY(id)
)
GO
CREATE INDEX kpi_fecha_llamada
ON public.cdr_ama_stat(fecha_llamada)
there should be unique values for abonado_a, abonado_b, fecha_llamada,
duracion in every row, googling around i found how to delete duplicates in
postgresonline site ,
so i run the following query (lets say i want to know how many duplicates
exists for 2004-04-18, before delete them):
SELECT * FROM cdr_ama_stat
WHERE id NOT IN
(SELECT MAX(dt.id)
FROM cdr_ama_stat As dt
WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +
INTERVAL '1 day'
GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)
AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +
INTERVAL '1 day'
my problem is that the query take forever, number of rows:
kpi=# select count(*) from cdr_ama_stat;
count
---------
5908065
(1 row)
this the explain result on the above query:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using kpi_fecha_llamada on cdr_ama_stat
(cost=115713.94..79528582.40 rows=140809 width=50)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time
zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time
zone))
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=115713.94..116202.56 rows=28162 width=34)
-> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34)
-> Sort (cost=110902.49..111606.53 rows=281618 width=34)
Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion
-> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30 rows=281618
width=34)
Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without
time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without
time zone))
-> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62 rows=281618
width=0)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time
zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time
zone))
am i doing something wrong?
I think several minutes should be ok, but not several hours as happens now,
i do a bulk load (350k rows aprox) in that table every day, and yes, i did
vacuum full analyze the database before posting this results.
Server is Dual Xeon 3.0 Ghx, 2Gb RAM
best regards
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2009-04-24 23:50:23 | Re: delete duplicates takes too long |
Previous Message | nighthawk | 2009-04-24 22:39:52 | Re: Posgres Adding braces at beginning and end of text (html) content |