From: | "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca> |
---|---|
To: | 'Rajesh Kumar Mallah' <mallah(at)trade-india(dot)com> |
Cc: | Postgres Performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Deleting certain duplicates |
Date: | 2004-04-16 17:52:15 |
Message-ID: | F2D63B916C88C14D9B59F93C2A5DD33F0B9234@cisxa.cmc.int.ec.gc.ca |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Thanks Mallah,
I will keep this example in case I need it again sometime in the future.
Unfortunately, I do not have enough free space at the moment to create a
temp table.
Dan
-----Original Message-----
From: Rajesh Kumar Mallah [mailto:mallah(at)trade-india(dot)com]
Sent: Tuesday, April 13, 2004 10:27 AM
To: Shea,Dan [CIS]
Cc: Postgres Performance
Subject: Re: [PERFORM] Deleting certain duplicates
Shea,Dan [CIS] wrote:
>The index is
>Indexes:
> "forecastelement_rwv_idx" btree (region_id, wx_element, valid_time)
>
>-----Original Message-----
>From: Shea,Dan [CIS] [mailto:Dan(dot)Shea(at)ec(dot)gc(dot)ca]
>Sent: Monday, April 12, 2004 10:39 AM
>To: Postgres Performance
>Subject: [PERFORM] Deleting certain duplicates
>
>
>We have a large database which recently increased dramatically due to a
>change in our insert program allowing all entries.
>PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where
relname
>= 'forecastelement';
> relname | relfilenode | reltuples
>-----------------+-------------+-------------
> forecastelement | 361747866 | 4.70567e+08
>
> Column | Type | Modifiers
>----------------+-----------------------------+-----------
> version | character varying(99) |
> origin | character varying(10) |
> timezone | character varying(99) |
> region_id | character varying(20) |
> wx_element | character varying(99) |
> value | character varying(99) |
> flag | character(3) |
> units | character varying(99) |
> valid_time | timestamp without time zone |
> issue_time | timestamp without time zone |
> next_forecast | timestamp without time zone |
> reception_time | timestamp without time zone |
>
>The program is supposed to check to ensure that all fields but the
>reception_time are unique using a select statement, and if so, insert it.
>Due an error in a change, reception time was included in the select to
check
>for duplicates. The reception_time is created by a program creating the
dat
>file to insert.
>Essentially letting all duplicate files to be inserted.
>
>I tried the delete query below.
>PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
>from forecastelement group by
>version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,is
s
>ue_time,next_forecast);
>It ran for 3 days creating what I assume is an index in pgsql_tmp of the
>group by statement.
>The query ended up failing with "dateERROR:write failed".
>Well the long weekend is over and we do not have the luxury of trying this
>again.
>So I was thinking maybe of doing the deletion in chunks, perhaps based on
>reception time.
>
>
its more of an sql question though.
to deduplicate on basis of
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast
You could do this.
begin work;
create temp_table as select distinct on
(version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast) * from forecastelement ;
truncate table forecastelement ;
drop index <index on forecastelement > ;
insert into forecastelement select * from temp_table ;
commit;
create indexes
Analyze forecastelement ;
note that distinct on will keep only one row out of all rows having
distinct values
of the specified columns. kindly go thru the distinct on manual before
trying
the queries.
regds
mallah.
>Are there any suggestions for a better way to do this, or using multiple
>queries to delete selectively a week at a time based on the reception_time.
>I would say there are a lot of duplicate entries between mid march to the
>first week of April.
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Ron St-Pierre | 2004-04-16 17:55:39 | Re: Index Problem? |
Previous Message | Tom Lane | 2004-04-16 17:04:58 | Re: Horribly slow hash join |