Re: Thoughts on how to avoid a massive integer update.

From: "Fehrle, Brian" <bfehrle(at)comscore(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Thoughts on how to avoid a massive integer update.
Date: 2020-05-08 16:57:49
Message-ID: 0F5992D4-9317-4007-9036-202DC89BA0D8@comscore.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 5/4/20, 3:56 PM, "Adrian Klaver" <adrian(dot)klaver(at)aklaver(dot)com> wrote:

[External Email]


On 5/4/20 2:32 PM, Fehrle, Brian wrote:
> Hi all,
>
> This is a shot in the dark in hopes to find a magic bullet to fix an
> issue I have, I can’t personally think of any solution myself.
>
> I have a database with hundreds of terabytes of data, where every table
> has an integer column referencing a small table. For reasons out of my
> control and cannot change, I NEED to update every single row in all
> these tables, changing the integer value to a different integer.
>
> Since I have to deal with dead space, I can only do a couple tables at a
> time, then do a vacuum full after each one.

Why?
A regular vacuum would mark the space as available.

A regular vacuum would mark the space as available – ***for re-use***, it will not release the space back to the drive as ‘unused’. 99% of my tables are old data that will not receive any future inserts or updates, which means that space that is marked ready for ‘reuse’ will not ever be used.
This means that a 100GB table will be updated, and every row marked as dead and re-created with the newly updated data. This table is now 200GB in size. A vacuum will keep it at 200GB of space used, freeing up the 100GB of dead space as ready-to-reuse. A vacuum full will make it 100GB again.

Since 99% of my tables will never be updated or inserted into again, this means my ~300 Terabytes of data would be ~600 Terabytes of data on disk. Thus, vacuum full.

More below.

> Another option is to build a new table with the new values, then drop
> the old one and swap in the new, either way is very time consuming.
>
> Initial tests suggest this effort will take several months to complete,
> not to mention cause blocking issues on tables being worked on.
>
> Does anyone have any hackery ideas on how to achieve this in less time?
> I was looking at possibly converting the integer column type to another
> that would present the integer differently, like a hex value, but
> everything still ends up requiring all data to be re-written to disk. In
> a well designed database (I didn’t design it :) ), I would simply change
> the data in the referenced table (200 total rows), however the key being
> referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be
> changed.

I'm not following above.

Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer

*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),

>
> Thanks for any thoughts or ideas,
>
> * Brian F
>


--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargent 2020-05-08 17:04:57 Re: Thoughts on how to avoid a massive integer update.
Previous Message Tim Clarke 2020-05-08 16:52:07 Re: Best way to use trigger to email a report ?