Large object loading stalls

From: Michael Akinde <michael(dot)akinde(at)met(dot)no>
To: pgsql-general(at)postgresql(dot)org
Subject: Large object loading stalls
Date: 2009-02-18 13:27:41
Message-ID: 499C0CCD.7010304@met.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

We have been running into a strange problem on our production servers of
a system we are about to roll out with the loading of large objects. At
various times (usually separated by a few days), the loading process
against the server will suddenly either take an extremely long time
(going from the usual 15-20 minutes up to 45 or more minutes) or simply
hang up altogether (i.e., when 8+ hours have gone by, we haven't had the
patience to wait and see if it eventually finishes). This is on an 8.3.3
server running on debian etch - most of the previous testing occurred on
8.1.10, where we did not encounter this problem (though we had other
issues on that platform).

I have run some tests with the loading of large objects into a
PostgreSQL database (8.3.3) using a more rapid turnaround (loading,
deleting, vacuuming directly and then starting the loading over again)
than in production and have been able to provoke what seems to be the
same issue there - usually within 12 hours.

Essentially, we have a C++ program (pqxx) loading a large number of
large objects (somewhere around a 80,000 or so) exhibiting strange
behavior. It varies quite a lot in loading time (from 1600 to 2500
seconds), and at random times it will simply just hang up (at least it
seems so, given that it has not progressed at all after 7-8 hours). It
displays this behaviour irrespective of whether we load all 80000 fields
in one go, or the loading is broken up into smaller continuos bits of
1000 blobs at a time (though it does seem easier to provoke the problem
with the former loading approach). It can run fine for hours at a time,
and then suddenly lock up (in the latest test I did, it loaded the data
correctly 22 times in a row, before locking up on the 23rd run through).

The pattern seems to be identical each time; ps aux shows the process
(wdb) idle in transaction, with no activity on the system being apparent:

postgres 19965 0.0 0.0 72628 5584 ? S Feb17 0:01
/usr/lib/postgresql/8.3/bin/postgres -D /var/lib/postgresql/8.3/main -c
config_file=/etc/postgresql/8.3/main/postgresql.conf
postgres 19967 0.4 0.0 72748 27636 ? Ss Feb17 6:38
postgres: writer
process

postgres 19968 0.0 0.0 72628 1784 ? Ss Feb17 0:09
postgres: wal writer
process

postgres 19969 0.0 0.0 72940 1924 ? Ss Feb17 0:00
postgres: autovacuum launcher
process

postgres 19970 0.0 0.0 42352 1736 ? Ss Feb17 0:02
postgres: stats collector
process

postgres 16617 0.0 0.0 95156 10948 ? Ss 08:42 0:00
postgres: autovacuum worker process
wdb

postgres 18834 0.0 0.1 118632 37320 ? Ss 09:06 0:01
postgres: wdb wdb [local] idle in
transaction

A core dump on the program we did earlier seemed to indicate that the
program was waiting on the database to return on the insert statement.
top indicates that both postgres and the application are idling (no load
on the server whatsoever).

pg_locks and pg_stat_activity seem to indicate that there may perhaps be
some issue with autovacuum?

locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------
relation | 16385 | 2683 | | |
| | | | | 2/1122117 |
16617 | RowExclusiveLock | t
relation | 16385 | 10969 | | |
| | | | | 3/1105 |
27895 | AccessShareLock | t
relation | 16385 | 2613 | | |
| | | | | 2/1122117 |
16617 | ShareUpdateExclusiveLock | t
virtualxid | | | | | 3/1105
| | | | | 3/1105 |
27895 | ExclusiveLock | t
virtualxid | | | | | 2/1122117
| | | | | 2/1122117 |
16617 | ExclusiveLock | t
(5 rows)

datid | datname | procpid | usesysid | usename |
current_query | waiting |
xact_start | query_start |
backend_start | client_addr | client_port
-------+---------+---------+----------+----------+------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------
16385 | wdb | 16617 | 10 | postgres | autovacuum: VACUUM
ANALYZE pg_catalog.pg_largeobject | f | 2009-02-18
08:42:45.066922+00 | 2009-02-18 08:42:45.066922+00 | 2009-02-18
08:42:45.065857+00 | |
16385 | wdb | 27895 | 16384 | wdb | select * from
pg_stat_activity ; | f | 2009-02-18
12:21:47.925813+00 | 2009-02-18 12:21:47.925813+00 | 2009-02-18
11:09:31.201507+00 | | -1
16385 | wdb | 18834 | 16384 | wdb | <IDLE> in
transaction | f
| | 2009-02-18 09:06:26.479633+00 |
2009-02-18 09:06:23.443698+00 | | -1
(3 rows)

Any leads as to how to figure out the problem or alternatively how to
work around it would be much appreciated.

I noticed that the disk on the test machine seemed to be running low on
disk caused by an error in the fsm settings, but this was not the case
in the production systems, so I assume that this is not the problem here
either (will fix it for future test runs, though).

Regards,

Michael Akinde

Attachment Content-Type Size
michael.akinde.vcf text/x-vcard 287 bytes

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Glyn Astill 2009-02-18 14:27:39 Query palns and tug-of-war with enable_sort
Previous Message Sam Mason 2009-02-18 12:26:29 Re: transfering tables into other schema