Re: Deleting Rows From Large Tables

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Rob Emery <re-pgsql(at)codeweavers(dot)net>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Deleting Rows From Large Tables
Date: 2013-05-19 20:36:50
Message-ID: CAMkU=1y1+nTdK0QBgxhV4ZQAXn8Gqe2XgdYESbt9nSfXC2qO1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, May 17, 2013 at 4:26 AM, Rob Emery <re-pgsql(at)codeweavers(dot)net> wrote:

> Hi All,
>
> We've got 3 quite large tables that due to an unexpected surge in
> usage (!) have grown to about 10GB each, with 72, 32 and 31 million
> rows in. I've been tasked with cleaning out about half of them, the
> problem I've got is that even deleting the first 1,000,000 rows seems
> to take an unreasonable amount of time. Unfortunately this is on quite
> an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
> 8.4; which serves other things like our logging systems.
>

How many Cores do you have? I think the Dell 2950 could have anywhere from
1 to 8.

Pick a smaller number of rows to delete, and run it with "explain analyze"
to see what it is going on. I would say to use "explain (analyze,
buffers)" with track_io_timing on, but those don't exist back in 8.4.

Perhaps this would be a good excuse to upgrade!

If I run a sustained (more than about 5 minutes) delete it'll have a
> detrimental effect on the other services.

Do you know why? Can you identify the affected queries from those other
services and run explain analyze on them?

> I'm trying to batch up the
> deletes into small chunks of approximately 1 month of data ; even this
> seems to take too long, I originally reduced this down to a single
> day's data and had the same problem. I can keep decreasing the size of
> the window I'm deleting but I feel I must be doing something either
> fundamentally wrong or over-complicating this enormously.

If your server is sized only to do its typical workload, then any
substantial extra work load is going to cause problems. Trying to delete 1
day's work in a few seconds stills seems like it is very likely excessive.
Why not jump all the way down to 5 minutes, or limit it to a certain
number of rows from table a, say 100 per unit? If you start large and work
your way down, you will often be working in the dark because you won't have
the patience to let the large ones run to completion, slowing down the
whole system. If you start at the bottom and work up, you will always know
where you are as the previous one ran to completion and you have the
timings from it.

How fast do you need to clean this up? If it took months to get into the
situation, can't you take a few weeks to get out of it?

> I've
> switched over to retrieving a list of IDs to delete, storing them in
> temporary tables and deleting based on the primary keys on each of the
> tables with something similar to this:
>
> BEGIN TRANSACTION;
>
> CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
> CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);
>
> INSERT INTO table_a_ids_to_delete
> SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
> < '2007-01-01T00:00:00';
>

I'd probably add a "LIMIT 100" in there. Then you can set created_at to
the final time point desired, rather than trying to increment it each time
and deciding how much to increment.

>
> INSERT INTO table_b_ids_to_delete
> SELECT table_b_id FROM table_a_table_b_xref
> INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
> table_a_table_b.quote_id);
>

Do these to queries slow down other operations? Or is it just the deletes?

>
> DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
> WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;
>
> DELETE FROM table_b USING table_b_ids_to_delete
> WHERE table_b.id = table_b_ids_to_delete.id;
>
> DELETE FROM table_a USING table_a_ids_to_delete
> WHERE table_a.id = table_a_ids_to_delete.id;
>
> COMMIT;
>

How much time to do the 3 deletes take relative to each other and to the
inserts?

Cheers,

Jef

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Spiegelberg 2013-05-19 21:14:07 Re: Deleting Rows From Large Tables
Previous Message jonranes 2013-05-19 04:40:04 Re: PostgreSQL 9.2.3 performance problem caused Exclusive locks