Moving data from huge table slow, min() query on indexed column taking 38s

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Moving data from huge table slow, min() query on indexed column taking 38s
Date: 2023-07-06 15:00:19
Message-ID: 096aa4b3-892e-7fa5-7370-3a1895e7d6a0@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to
PARTITION BY RANGE on an indexed integer column.

I have created the new partitioned table "tbl" and 1000 partitions in it,
so that it can take my existing data and have the partitions ready for the
next year (so most of the new partitions will be empty for now).

On the topic of bulk-transferring the data from tbl_legacy to tbl:

+ First I chose the method to DELETE...INSERT everything.

I noticed that the postgres process was growing without bounds, up to
the point that it consumed almost all of swap space and I had to kill
the command. Any idea why that happens?

+ Then I broke the command above in many small chunks

WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
) INSERT INTO tbl SELECT * FROM rows;
COMMIT;

I increase the parameter $1 and keep going in a loop. At first this
goes OK, after one day though I notice that it has slowed down
significantly. My investigation shows that the simple query

SELECT min(partition_key_column) from tbl_legacy;

takes 38s, despite having an index on the column! A VACUUM fixes that,
so I guess the index has a ton of dead tuples. I guess autovacuum does
not do its job because the table is constantly busy.

Unfortunately VACUUM takes long (several hours) on this huge table, so I
can't add in the loop after the DELETE command.

Is there a better way to avoid the bloat in the loop above? Why can't
the DELETE command update the index by pointing the minimum element
beyond the dead tuples?

+ Finally I resorted to just copying the data:

INSERT INTO tbl SELECT * FROM tbl_legacy;

This took half a day or so but worked without blowing up the memory
usage like the 1st query. Why is that?

By the way, I don't consider the last solution ideal, since the data is
duplicated between the two tables and it's harder to implement
workarounds to do it without taking downtime. But it worked for me for
this time. Any other ideas for partitioning an existing huge table?

Thanks in advance,
Dimitris

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dimitrios Apostolou 2023-07-06 16:11:53 Re: Moving data from huge table slow, min() query on indexed column taking 38s
Previous Message Pavel Stehule 2023-07-06 14:42:38 Re: function to_char(unknown) is not unique at character 8