Best way to delete big amount of records from big table

From: Ekaterina Amez <ekaterina(dot)amez(at)zunibal(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Best way to delete big amount of records from big table
Date: 2020-03-27 14:13:49
Message-ID: 6e456b07-a396-f9e3-6a22-23a96b73420c@zunibal.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello list,

I'm trying to clean up a database with millions of records of
useless-but-don't-remove-just-in-case data. This database has all tables
in public schema so I've created a new schema "old_data" to move there
all this data. I have several tables with 20million of records or so
that I've managed to clean up relatively fast without special effort
(not having to drop indexes or constraints) What I've made with these
tables is easy as these ones are going to be emptied (I have to keep
tables) so I only have to insert data into old_data.new_table and
truncate cascade.

But also I'm cleaning tables with 150million records where I'm going to
remove 60% of existing data and after a few tests I'm not sure what's
the best approach as all seem to take similar time to run. These tables
are grouped in 4 tables group with master, detail, master_history,
detail_history structure. None of the tables have primary key nor
foreign key or any constraint but the sequence used for what should be
the PK column, though this column is not defined as PK.

I've decided to delete from the last one in chunks (10 days of data per
chunk but it coud be any other quantity) so I've created a function. 
I've tested it with indexes (in master_hist for filtering data and in
detail_hist for the fk and pk), without indexes, after analyzing table,
and no matter what I always end up with more or less the same execution
time. I can afford the time it's getting to run but I'd like to know if
it's there a better way to do this. I'm testing on version 9.2 BUT
production server is 8.4 (legacy application, supposed to be in at least
9.2 but recently discovered it was 8.4, planning upgrade but not now).
Config parameters are default ones.

Table definition:

CREATE TABLE master (

  id integer serial NOT NULL,
  device_id int4 NOT NULL,
  col1 int4 NULL DEFAULT 0,
  data_date bpchar(17) NULL, -- field to filter data
  data_file_date bpchar(14) NULL
); -- 9 of 20 records to be removed

CREATE TABLE detail (
  id integer serial NOT NULL,
  parent_id int4 NOT NULL,
  col1 float8 NULL,
  col2 int4 NOT NULL
); -- 2304 of 5120 records to be removed

CREATE TABLE master_history (
  id integer serial NOT NULL,
  device_id int4 NOT NULL,
  col1 int4 NULL DEFAULT 0,
  data_date bpchar(17) NULL, -- field to filter data
  data_file_date bpchar(14) NULL
);  --355687 of 586999 records to be removed

CREATE TABLE detail_history (
  id integer serial NOT NULL,
  parent_id int4 NOT NULL,
  col1 float8 NULL,
  col2 int4 NOT NULL
); -- 91055872 of  150.271.744 records to be removed

And the function:

CREATE or replace FUNCTION delete_test() RETURNS integer AS $$
DECLARE
    _begin_date date;
    _end_date date := '2019-08-01';
    _begin_exec timestamp := clock_timestamp();
    _end_exec timestamp ;
    _begin_exec_partial timestamp;
    _end_exec_partial timestamp;
    _time double precision;
    _num_regs integer;
BEGIN
    for _begin_date in (select '2018-05-01'::date + s.a * '10
days'::interval from (select generate_series(0,1000) as a) as s)
    loop
        if (_begin_date > _end_date) then
            raise log 'STOP!!!!!';
            exit;
        end if;
        raise log 'Date %', _begin_date;
        _begin_exec_partial := clock_timestamp();
        delete from public.detail_history t1
          where exists
            (select 1 from public.master_history t2
              where t2.id = t1.parent_id
                and t2.data_date >= rpad(to_char(_begin_date,
'YYYYMMDD'), 17, '0')
                and t2.data_date < rpad(to_char((_begin_date + interval
'10 days'), 'YYYYMMDD'), 17, '0'));
        GET DIAGNOSTICS _num_regs = ROW_COUNT;
        _end_exec_partial := clock_timestamp();
        _time := 1000 * ( extract(epoch from _end_exec_partial) -
extract(epoch from _begin_exec_partial) );
        raise log 'Records removed % in % ms', _num_regs, _time;

    end loop;

    _end_exec := clock_timestamp();
    _time := 1000 * ( extract(epoch from _end_exec) - extract(epoch
from _begin_exec) );
    raise log 'Total time: %', _time;
    return 0;
END;
$$ LANGUAGE plpgsql;

Delete execution plan in 8.4 is:

test_eka=# explain delete from public.detail_hist t1
test_eka-#   where exists
test_eka-#     (select 1 from public.master_hist t2
test_eka(#       where t2.id = t1.parent_id
test_eka(#         and t2.data_date >= '20180501000000000000000'
test_eka(#         and t2.data_date < '20190101000000000000000');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=33431.46..5890182.88 rows=156649104 width=6)
   Hash Cond: (t1.parent_id = t2.id)
   ->  Seq Scan on detail_hist t1  (cost=0.00..2564256.04
rows=156649104 width=10)
   ->  Hash  (cost=30922.13..30922.13 rows=152906 width=4)
         ->  Unique  (cost=30157.60..30922.13 rows=152906 width=4)
               ->  Sort  (cost=30157.60..30539.87 rows=152906 width=4)
                     Sort Key: t2.id
                     ->  Seq Scan on master_hist t2
(cost=0.00..14897.65 rows=152906 width=4)
                           Filter: ((data_date >=
'20180501000000000000000'::bpchar) AND (data_date <
'20190101000000000000000'::bpchar))

After PK-FK creation (with IX over FK)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=26678.41..5883424.77 rows=156648960 width=6)
   Hash Cond: (t1.id_param_espec_este = t2.id_param_espec_este_historico)
   ->  Seq Scan on param_espec_este_datos_historico_tbl t1
(cost=0.00..2564254.60 rows=156648960 width=10)
   ->  Hash  (cost=24169.09..24169.09 rows=152906 width=4)
         ->  Unique  (cost=23404.56..24169.09 rows=152906 width=4)
               ->  Sort  (cost=23404.56..23786.82 rows=152906 width=4)
                     Sort Key: t2.id_param_espec_este_historico
                     ->  Index Scan using fecha_gps_pe_este_hist_idx on
param_espec_este_historico_tbl t2 (cost=0.00..8144.60 rows=152906 width=4)
                           Index Cond:
((fecha_gps_parametros_espectrales >= '20180501000000000000000'::bpchar)
AND (fecha_gps_parametros_espectrales < '20190101000000000000000'::bpchar))

Any ideas are welcome.

Kind regards,

Ekaterina.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Lewis 2020-03-27 14:41:04 Re: Best way to delete big amount of records from big table
Previous Message Anders Steinlein 2020-03-24 22:55:29 Re: Slow planning time when public schema included (12 vs. 9.4)