Re: reducing IO and memory usage: sending the content of a table to multiple files

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: reducing IO and memory usage: sending the content of a table to multiple files
Date: 2009-04-03 14:50:00
Message-ID: 20090403145000.GV12225@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Apr 03, 2009 at 11:09:56AM +0200, Ivan Sergio Borgonovo wrote:
> On Fri, 3 Apr 2009 02:05:19 +0100 Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> > On Thu, Apr 02, 2009 at 09:48:33PM +0200, Ivan Sergio Borgonovo wrote:
> > > I didn't find any elegant example of cursor use in PHP... OK PHP
> > > is not the most elegant language around... but still any good
> > > exapmle someone could point me at?
>
> > You can obviously increase the "FETCH" upwards and if you're
> > feeling fancy you could even run the FETCH async from the code
> > that processes the results. Maybe something like:
>
> What kind of advantage should I get running asynchronously?

PG will be running the next "FETCH" command while you're processing the
last one. Otherwise you're serialising everything; i.e.

1) ask for results
2) PG executes query
3) PG sends you results
4) you process results

all happen sequentially, even though they can be done in parallel. If
you're doing it async I'd expect that steps 2 and 3 will be done in the
background while your code is running step 4. I've done this sort of
thing in C before but not from PHP, there shouldn't be much difference
though. If you've got a step 5 of sending them off somewhere else, then
you may be able to arrange for this to happen in parallel to.

I'd stay away from threads if at all possible; it's very easy to
introduce obscure bugs that you don't notice for a long time. It's
easy to write multi-threaded code, what's *very* difficult is to write
correct multi-threaded code that solves a non-trivial problem.

> oh I didn't mean you were suggesting any advantage... just wondering.
> It could be an option if once everything is up I want to keep under
> control resources sucked by this process (?).

It's only ever going to get one "block" (i.e. 1000 rows in the example
above) ahead of itself. The resource consumption is quite tightly
bounded.

> Filtering is currently very simple... I'm building a very simple xml
> just queueing constant strings and what comes out of the DB.
> But if I had to parse a CSV (split) or just assign names to columns
> (and I expect this stuff is going to be adjusted frequently) or use
> regexp... I was expecting to waste more human cycles or CPU cycles
> than avoiding to rely on optimised IO of \copy (if any).

OK, it was this I was wondering. I wasn't sure if you were just
gzipping some sort of CSV output or something. Once you have to start
dealing with individual fields apart you're probably better off with
using normal queries.

> Still I think I've read on this list that compression was going
> to be a bottleneck more than IO.
> I just did a preliminary test and xml-ing and gzipping 80K records
> out of 1M takes less than 2sec.
> So maybe I was over concerned.

The interesting measurement is throughput in bytes per second. A single
harddrive will stream data out at 80MB/s, gzip is much slower than this
whenever I tried it. There are lighter weight compression schemes, but
I think you said you needed gzip so it's not worth going into much.

> > Not sure about PG, but the C api pretty much always buffers
> > everything in memory first. There was mention of getting control
> > of this, but I've got no idea where it got.
>
> buffer *everything*?

As far as I know (i.e. this is how the C library works) the data behind
your "$result" is copied from PG and into memory on the machine that's
running your PHP code before pg_query() returns. If you're asking for
the whole of a table with a billion rows you're probably going to run
out of memory and not get any useful work done.

When you use cursors; your PHP code is only dealing with, say, 1000 rows
at a time and and PG is worrying about where to keep the rest. Thus
resource consumption is under your control with cursors and otherwise
it's at the mercy of you writing sensible queries.

The motivating reason for this behaviour is handling of errors in
processing; the assumption is that your code shouldn't get a result set
back until it's known that these results are complete and consistent.
The easiest way to do this is to buffer them in memory until you've got
everything back from PG. E.g. what if the last row generated a divide
by zero exception, or something caused the backend to die.

--
Sam http://samason.me.uk/

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2009-04-03 14:58:24 Re: Rule or Function and Trigger?
Previous Message Scott Marlowe 2009-04-03 14:32:43 Re: high load on server