Re: reducing IO and memory usage: sending the content of a table to multiple files

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: reducing IO and memory usage: sending the content of a table to multiple files
Date: 2009-04-03 09:09:56
Message-ID: 20090403110956.477596a8@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 3 Apr 2009 02:05:19 +0100
Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:

> On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo
> wrote:
> > I didn't find any elegant example of cursor use in PHP... OK PHP
> > is not the most elegant language around... but still any good
> > exapmle someone could point me at?

> You can obviously increase the "FETCH" upwards and if you're
> feeling fancy you could even run the FETCH async from the code
> that processes the results. Maybe something like:

What kind of advantage should I get running asynchronously?
oh I didn't mean you were suggesting any advantage... just wondering.
It could be an option if once everything is up I want to keep under
control resources sucked by this process (?).

> > So I think the largest cost of the operation will be IO.
> > \copy should be optimised for "raw" data output, but maybe all
> > its advantages get lost once I've to use pipes and adding
> > complexity to filtering.

> Streaming IO is pretty fast, I think you'll be hard pushed to keep
> up with it from PHP and you'll end up CPU bound in no time. Be
> interesting to find out though.

Filtering is currently very simple... I'm building a very simple xml
just queueing constant strings and what comes out of the DB.
But if I had to parse a CSV (split) or just assign names to columns
(and I expect this stuff is going to be adjusted frequently) or use
regexp... I was expecting to waste more human cycles or CPU cycles
than avoiding to rely on optimised IO of \copy (if any).
Most of the operations end up being:
$output=SOMEHEADER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);
while(...) {
$record1='<sometag
someattr>.SOMECONST1.$row['col4'].</sometag><someother
someattr>'.$row['col3'].'</someothertag>';
$record2='<sometag
someattr>.SOMECONST2.$row['col4'].</sometag><someother
someattr>'.$row['col3'].'</someothertag>';
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);
}
$output=SOMEFOOTER;
gzwrite($f1);
gzwrite($f2);
gzwrite($f3);
gzwrite($f4);

I've the largest table of my DB to be sliced into multiple xml files
that have to be written on disk.
So actually 1) reading the whole table and returning 30% of its
fields 2) writing all these data multiple times.
This is by far the largest write load the server is going to incur
in a day.
But well it may be the largest CPU load it is going to incur in a
day as well considering I've to gzip all the files.
Still I think I've read on this list that compression was going
to be a bottleneck more than IO.
I just did a preliminary test and xml-ing and gzipping 80K records
out of 1M takes less than 2sec.
So maybe I was over concerned.

Anyway I'd like to understand a bit better how IO and memory
consumption is managed once you've cursor vs. plain select and
client drivers in the middle.

> > I was reading about all the php documents and trying to
> > understand how buffers and memory usage works, so I gave a look
> > to MySQL documents too...
>
> Not sure about PG, but the C api pretty much always buffers
> everything in memory first. There was mention of getting control
> of this, but I've got no idea where it got.

buffer *everything*?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gerd König 2009-04-03 12:31:26 high load on server
Previous Message rafalak 2009-04-03 08:20:33 Re: slow select in big table