RE: Updating large tables without dead tuples

From: "ldh(at)laurent-hasson(dot)com" <ldh(at)laurent-hasson(dot)com>
To: Vik Fearing <vik(dot)fearing(at)2ndquadrant(dot)com>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>
Subject: RE: Updating large tables without dead tuples
Date: 2018-03-10 23:42:37
Message-ID: BY2PR15MB0872ED7F4F8158F6943A7DED85DD0@BY2PR15MB0872.namprd15.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> -----Original Message-----
> From: Vik Fearing [mailto:vik(dot)fearing(at)2ndquadrant(dot)com]
> Sent: Friday, March 02, 2018 20:56
> To: ldh(at)laurent-hasson(dot)com; pgsql-performance(at)lists(dot)postgresql(dot)org
> Cc: Stephen Frost <sfrost(at)snowman(dot)net>
> Subject: Re: Updating large tables without dead tuples
>
> On 02/24/2018 12:27 AM, ldh(at)laurent-hasson(dot)com wrote:
> > Hello
> >
> >
> >
> > I work with a large and wide table (about 300 million rows, about 50
> > columns), and from time to time, we get business requirements to make
> > some modifications. But sometimes, it's just some plain mistake. This
> > has happened to us a few weeks ago where someone made a mistake and we
> > had to update a single column of a large and wide table. Literally,
> > the source data screwed up a zip code and we had to patch on our end.
> >
> >
> >
> > Anyways. Query ran was:
> >
> >     update T set source_id = substr(sourceId, 2, 10);
> >
> > Took about 10h and created 100's of millions of dead tuples, causing
> > another couple of hours of vacuum.
> >
> >
> >
> > This was done during a maintenance window, and that table is read-only
> > except when we ETL data to it on a weekly basis, and so I was just
> > wondering why I should pay the "bloat" penalty for this type of
> > transaction. Is there a trick that could be use here?
> Yes, there is a trick I like to use here, as long as you don't mind locking the
> table (even against reads).
>
> I'll assume T.source_id is of type text. If it's not, use whatever the actual type
> is.
>
> ALTER TABLE T
> ALTER COLUMN source_id TYPE text USING substr(sourceId, 2, 10);
>
> I copied what you had verbatim, I earnestly hope you don't have two columns
> source_id and sourceId in your table.
>
> This will rewrite the entire table just the same as a VACUUM FULL after your
> UPDATE would.
>
> Don't forget to VACUUM ANALYZE this table after the operation. Even though
> there will be no dead rows, you still need to VACUUM it to generate the
> visibility map and you need to ANALYZE it for statistics on your "new" column.
>
> Foreign keys remain intact with this solution and you don't have double wal
> logging like for an UPDATE.
> --
> Vik Fearing +33 6 46 75 15 36
> http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

[Laurent Hasson]
Yes, sorry... only a single column source_id. I understand your idea... Is that because a TEXT field (vs a varchar) would be considered TOAST and be treated differently?

Thanks,
Laurent.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message dangal 2018-03-11 12:48:41 Memory size
Previous Message Pavel Stehule 2018-03-08 18:00:36 Re: [HACKERS] proposal: schema variables