Re: Massive table (500M rows) update nightmare

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Carlo Stonebanks <stonec(dot)register(at)sympatico(dot)ca>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Massive table (500M rows) update nightmare
Date: 2010-01-08 18:08:36
Message-ID: 20100108180836.GJ1895@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Jan 08, 2010 at 12:38:46PM -0500, Carlo Stonebanks wrote:
>> I thought that post mentioned that the plan
>> was one statement in an iteration, and that the cache would have
>> been primed by a previous query checking whether there were any rows
>> to update. If that was the case, it might be worthwhile to look at
>> the entire flow of an iteration.
>
> This is the only SQL query in the code in question - the rest of the code
> manages the looping and commit. The code was copied to PgAdminIII and
> values written in for the WHERE clause. In order for me to validate that
> rows would have been updated, I had to run a SELECT with the same WHERE
> clause in PgAdminIII first to see how many rows would have qualified. But
> this was for testing purposes only. The SELECT statement does not exist in
> the code. The vast majority of the rows that will be processed will be
> updated as this is a backfill to synch the old rows with the values being
> filled into new columns now being inserted.
>
>> Also, if you ever responded with version and configuration
>> information, I missed it.
>
> This is hosted on a new server the client set up so I am waiting for the
> exact OS and hardware config. PG Version is PostgreSQL 8.3.6, compiled by
> Visual C++ build 1400, OS appears to be Windows 2003 x64 Server.
>
> More than anything, I am more concerned with the long-term use of the
> system. This particular challenge with the 500M row update is one thing,
> but I am concerned about the exceptional effort required to do this. Is it
> REALLY this exceptional to want to update 500M rows of data in this day and
> age? Or is the fact that we are considering dumping and restoring and
> dropping indexes, etc to do all an early warning that we don't have a
> solution that is scaled to the problem?
>
> Config data follows (I am assuming commented values which I did not include
> are defaulted).
>
> Carlo
>

Hi Carlo,

It all boils down to resource management and constraints. For small
problems relative to the amount of system resources available, little
effort is needed to have satisfactory performance. As the problems
consume more and more of the total resource capacity, you will need
to know more and more in depth about the workings of every piece of
the system to wring the most possible performance out of the system.
Some of the discussions on this topic have covered a smattering of
details that will become increasingly important as your system scales
and determine whether or not it will scale. Many times the need for
updates on such a massive scale do point to normalization problems.

My two cents.

Cheers,
Ken

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2010-01-08 18:12:00 Re: Massive table (500M rows) update nightmare
Previous Message Kaloyan Iliev Iliev 2010-01-08 17:58:02 Change query join order