From: | "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca> |
---|---|
To: | "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca>, Postgres Performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Deleting certain duplicates |
Date: | 2004-04-12 15:18:31 |
Message-ID: | F2D63B916C88C14D9B59F93C2A5DD33F0B921A@cisxa.cmc.int.ec.gc.ca |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
The index is
Indexes:
"forecastelement_rwv_idx" btree (region_id, wx_element, valid_time)
-----Original Message-----
From: Shea,Dan [CIS] [mailto:Dan(dot)Shea(at)ec(dot)gc(dot)ca]
Sent: Monday, April 12, 2004 10:39 AM
To: Postgres Performance
Subject: [PERFORM] Deleting certain duplicates
We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
relname | relfilenode | reltuples
-----------------+-------------+-------------
forecastelement | 361747866 | 4.70567e+08
Column | Type | Modifiers
----------------+-----------------------------+-----------
version | character varying(99) |
origin | character varying(10) |
timezone | character varying(99) |
region_id | character varying(20) |
wx_element | character varying(99) |
value | character varying(99) |
flag | character(3) |
units | character varying(99) |
valid_time | timestamp without time zone |
issue_time | timestamp without time zone |
next_forecast | timestamp without time zone |
reception_time | timestamp without time zone |
The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates. The reception_time is created by a program creating the dat
file to insert.
Essentially letting all duplicate files to be inserted.
I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement.
The query ended up failing with "dateERROR:write failed".
Well the long weekend is over and we do not have the luxury of trying this
again.
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
From | Date | Subject | |
---|---|---|---|
Next Message | Jeremy Dunn | 2004-04-12 15:40:28 | index v. seqscan for certain values |
Previous Message | Shea,Dan [CIS] | 2004-04-12 14:39:22 | Deleting certain duplicates |