processing large amount of rows with plpgsql

From: Geert Mak <pobox(at)verysmall(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: processing large amount of rows with plpgsql
Date: 2012-08-08 19:41:17
Message-ID: E8B87AF3-D112-43A3-9EC7-155FD377BFB7@verysmall.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

hello everybody,

we are trying to move the data from table1 into table2 using a plpgsql stored procedure which is performing simple a data conversion

there are about 50 million rows

the tables are relatively simple, less than a dozen columns, most are integer, a couple are char(32) and one is varchar holding URLs

what happens is that when we execute the stored procedure, the execution eats up the 17 GB free space on the server and the server crashes

if we apply a limit of 1 million rows, the execution completes successfully in about a minute

we understand, by now, that plpgsql functions are executed in their own transaction, which commits when the function comes to an end

the question is -

are plpgsql functions suitable for such massive operations on data and can this work without so much disk space is being eaten for something which should be simply "read-change-write, read-change-write, read-change-write, …"? i. e. any way to force commit inside, or so?

or should we rather implement this operation in some external scripting language (speed is not that important, this is a one time conversion)

thank you,
geert
>From pgsql-general-owner(at)postgresql(dot)org Wed Aug 8 17:04:49 2012
Received: from makus.postgresql.org (makus.postgresql.org [98.129.198.125])
by mail.postgresql.org (Postfix) with ESMTP id 02FAB166774D
for <pgsql-general(at)postgresql(dot)org>; Wed, 8 Aug 2012 17:04:23 -0300 (ADT)
Received: from mail-ob0-f174.google.com ([209.85.214.174])
by makus.postgresql.org with esmtp (Exim 4.72)
(envelope-from <mmoncure(at)gmail(dot)com>)
id 1SzCUQ-0002Lk-BA
for pgsql-general(at)postgresql(dot)org; Wed, 08 Aug 2012 20:04:23 +0000
Received: by obbuo13 with SMTP id uo13so1583532obb.19
for <pgsql-general(at)postgresql(dot)org>; Wed, 08 Aug 2012 13:04:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s 120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type:content-transfer-encoding;
bh=wUlcx2VjMVbMCvZJohFr5/WRtzpj0UY4TrpS1E9Kf78=;
b¦g3i98QJRKLrx4fdBwYUeUf1YWx5aWYIXbEuB+uNr6GGmKnkSQBOh/OWsBDIHHUbe
PCvADFhDmK2DlHAWkVYfH2kHqYnXFvA0eqPcCiFPqhNErXfh/1ao0bamBbat2TlxUL+5
dHFHzitlOyR5hUIaKHNOTsX/mEIPZmOX8C9JMo+7OKF+FaC71K0cxw8zFpAjH2rWTVQc
I5Npl7pLWvf0Dd5YTlMlU3nUC0oXgwt5ORg01YhOF2ELX2owSYKJM03DCA9DP/xl/14f
yiRo+Rd7CN4YTzyT0c+D2VJUvtKFigp2ZGZXwPVfBqBS/riyGwx27ahBW/0v3CrjwQNE
a2dA=
MIME-Version: 1.0
Received: by 10.182.146.46 with SMTP id sz14mr31670094obb.76.1344456248839;
Wed, 08 Aug 2012 13:04:08 -0700 (PDT)
Received: by 10.76.27.39 with HTTP; Wed, 8 Aug 2012 13:04:08 -0700 (PDT)
In-Reply-To: <E8B87AF3-D112-43A3-9EC7-155FD377BFB7(at)verysmall(dot)org>
References: <E8B87AF3-D112-43A3-9EC7-155FD377BFB7(at)verysmall(dot)org>
Date: Wed, 8 Aug 2012 15:04:08 -0500
Message-ID: <CAHyXU0yr7n0nFB4bieYX7C9+v6CPwWmZ=1u1xQ2Pej0h6WGHeg(at)mail(dot)gmail(dot)com>
Subject: Re: processing large amount of rows with plpgsql
From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Geert Mak <pobox(at)verysmall(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
X-Pg-Spam-Score: -2.6 (--)
X-Archive-Number: 201208/177
X-Sequence-Number: 189354

On Wed, Aug 8, 2012 at 2:41 PM, Geert Mak <pobox(at)verysmall(dot)org> wrote:
> hello everybody,
>
> we are trying to move the data from table1 into table2 using a plpgsql stored procedure which is performing simple a data conversion
>
> there are about 50 million rows
>
> the tables are relatively simple, less than a dozen columns, most are integer, a couple are char(32) and one is varchar holding URLs
>
> what happens is that when we execute the stored procedure, the execution eats up the 17 GB free space on the server and the server crashes
>
> if we apply a limit of 1 million rows, the execution completes successfully in about a minute
>
> we understand, by now, that plpgsql functions are executed in their own transaction, which commits when the function comes to an end
>
> the question is -
>
> are plpgsql functions suitable for such massive operations on data and can this work without so much disk space is being eaten for something which should be simply "read-change-write, read-change-write, read-change-write, …"? i. e. any way to force commit inside, or so?
>
> or should we rather implement this operation in some external scripting language (speed is not that important, this is a one time conversion)

What is the general structure of the procedure? In particular, how
are you browsing and updating the rows? There is (almost) no way to
force commit inside a function -- there has been some discussion about
stored procedure and/or autonomous transaction feature in terms of
getting there.

I say 'almost' because you can emulate some aspects of autonomous
transactions with dblink, but that may not be a very good fit for your
particular case.

merlin

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fanbin Meng 2012-08-08 20:35:20 postgresql service stoped
Previous Message Brian McNally 2012-08-08 18:52:15 Re: Problem running "ALTER TABLE...", ALTER TABLE waiting