Quick Links

Performance on large data transformations

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Pgsql-performance(at)postgresql(dot)org
Subject:	Performance on large data transformations
Date:	2003-03-17 17:38:38
Message-ID:	200303170938.38365.josh@agliodbs.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Folks,

On one database, I have an overnight data transformation procedure that goes
like:

TableA has about 125,000 records.

Begin Transaction:
1) Update 80% of records in TableA
2) Update 10% of records in TableA
3) Update 65% of records in TableA
4) Update 55% of records in TableA
5) Update 15% or records in TableA with references to other records in TableA
6) Flag what hasn't been updated.
Commit

I've found that, no matter what my FSM settings (I've gone as high as
1,000,000) by the time I reach step 4 execution has slowed down considerably,
and for step 5 it takes the server more than 10 minutes to complete the
update statement. During this period, CPU, RAM and disk I/O are almost idle
... the system seems to spend all of its time doing lengthy seeks. There is,
for that matter, no kernel swap activity, but I'm not sure how to measure
Postgres temp file activity.

(FYI: Dual Athalon 1600mhz/1gb/Hardware Raid 1 with xlog on seperate SCSI
drive/Red Hat Linux 8.0/PostgreSQL 7.2.4)

The only way around this I've found is to break up the above into seperate
transactions with VACUUMs in between, and "simulate" a transaction by making
a back-up copy of the table and restoring from it if something goes wrong.
I've tried enough different methods to be reasonably certain that there is no
way around this in 7.2.4.

The reason I bring this up is that PostgreSQL's dramatic plunge in performance
in large serial updates is really problematic for us in the OLAP database
market, where large data transformations, as well as extensive use of
calculated temporary tables, is common. I was particularly distressed when
I had to tell a client considering switching from MSSQL to Postgres for an
OLAP database that they might just be trading one set of problems for
another.

Is there any way we can improve on this kind of operation in future versions
of PostgreSQL?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Responses

Re: Performance on large data transformations at 2003-03-17 20:06:39 from Ron Mayer

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Max Baker	2003-03-17 18:33:27	Re: postgresql meltdown on PlanetMath.org
Previous Message	Chris Sutton	2003-03-17 17:12:32	Re: postgresql meltdown on PlanetMath.org