Re: Fast way to delete big table?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Haiming Zhang <Haiming(dot)Zhang(at)redflex(dot)com(dot)au>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fast way to delete big table?
Date: 2016-05-16 17:42:03
Message-ID: CAMkU=1x48Kx=ztB021GNAodRR9_ZwfMGNhGJC6RW_b-z_ihKnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, May 15, 2016 at 10:09 PM, Haiming Zhang <
Haiming(dot)Zhang(at)redflex(dot)com(dot)au> wrote:

> Hi All,
>
>
>
> I have a big table (200G with 728 million records), the table slows down
> lots of things. It's time to clean the data up. The requirement is when I
> delete I must summarise few columns to a new table for backup purpose. So
> that mean I need to go through each row and add the columns' value to the
> summary table (the corresponding category).
>

Is the performance slow-down an emergency? If not, then delete and
summarize the rows in bite size chunks, over weeks or months. It took a
long time to get into the situation, spend some time calmly getting out of
it.

>
>
> The table has indexes (multicoumn indexes) before, I removed the indexes
> because I read on forum says the indexes *heavily* reduce the speed of
> deleting.
>

That generally isn't true in PostgreSQL. Deletes do not directly perform
any index maintenance (triggers might cause them to do some indirectly).
Doing the deletes does build up "vacuum debt" which must eventually be
paid, and more indexes does make that more expensive. But that happens in
the background.

> That's true if I only delete from the table, but my case is I first SELECT
> this row, then add few values to the summary table and then DELETE the row
> according to its multiple primary key. This seems to be a bad idea now as
> it takes very long time to DELETE the row (as delete needs to find the row
> first).
>

This doesn't match your example, as in your examples there are no WHERE
clauses on the main table select, so no index could be used to find the
rows.

Your examples also have no selection criteria or stopping criteria (other
than when your outer driving script stops calling the function). That
means you might be deleting any arbitrary rows in the master table (not
just the oldest ones) and might proceed with these deletions until the
table is empty. Given that, why not just summarize the entire table and
then truncate the whole thing?

...
>

You could try to rewrite this into a single SQL query, using a CTE with
"with t as (delete .... returning * )". The hardest part is that
PostgreSQL alas doesn't allow LIMIT on DELETE statements, so you have to go
through some ugly contortions with a sub-select, or a where clause on your
delete using, for example, a date to do the limiting, and have your script
increment the date on each invocation. I've not done this with
summarization, but have done it with partitioning where I had to move
tuples from the parent table to their new partition (or the reverse, moving
from an ill-conceived partition back to the parent) with a minimum of
disruption.

It should be faster, but I don't know how much. It would depend a lot on
whether your bottleneck is CPU or IO, amongst other things.

>
>
>
>
> Method 2 is bit faster but not much, the delete speed is 2478 rows/s for
> method 2 and 2008 rows/s for method 1.
>
> Any suggestions are welcomed.
>

Why is this not fast enough? So, it would take 4 days. How fast do you
need it to be?

But there is the possibility that it will slow down over time, as each
execution might have to scan over the debris of all of the previously
deleted rows before it gets to the ones you want to work on.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2016-05-16 19:41:23 Any difference between commit/rollback when only temp tables and \copy are used
Previous Message Francisco Olarte 2016-05-16 17:09:23 Re: Connections - Postgres 9.2