autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

From: "Jim Hurne" <jhurne(at)us(dot)ibm(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked
Date: 2020-06-16 15:59:37
Message-ID: OF3946AD2A.26D4644F-ON85258589.005766BA-85258589.0057DB01@notes.na.collabserv.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We have a cloud service that uses PostgreSQL to temporarily store binary
content. We're using PostgreSQL's Large Objects to store the binary
content. Each large object lives anywhere from a few hundred milliseconds
to 5-10 minutes, after which it is deleted.

Normally, this works just fine and we have no issues. However, roughly
every 3 months or so, disk usage on our PostgreSQL database starts to
increase at an alarming rate. More specifically, the pg_largeobject
table's disk usage steadily increases even though we've deleted most of
the large objects. This will continue until the database server runs out
of disk space (and then all sorts of bad things happen of course).

It looks to us like autovacuum against the pg_largeobject table stops
working because the pg_stat_all_tables.last_autovacuum column stops
updating, or is updated infrequently, while it is still updated on other
tables:

schemaname | relname | n_live_tup | n_dead_tup |
last_autovacuum
------------+-------------------------+------------+------------+-------------------------------
pg_catalog | pg_largeobject | 37205 | 92614852 |
2020-06-15 01:55:09.037101+00
pg_toast | pg_toast_973434724 | 281 | 3701 |
2020-06-15 17:22:54.746452+00
pg_catalog | pg_largeobject_metadata | 320 | 1236 |
2020-06-15 17:23:49.765878+00
pg_catalog | pg_shdepend | 344 | 933 |
2020-06-15 17:23:54.448855+00
pg_toast | pg_toast_2619 | 52 | 85 |
2020-06-15 17:22:01.387933+00
pg_catalog | pg_statistic | 413 | 100 |
2020-06-15 17:15:52.656791+00
pg_catalog | pg_class | 349 | 2 |
2020-06-12 17:58:13.147931+00
pg_catalog | pg_attribute | 2633 | 1 |
2020-06-12 17:58:13.000114+00

Looking at our PostgreSQL logs, it looks like the autovacuum task takes
increasingly more time to run (against the pg_largeobject table):

2020-06-12T19:41:58.335931494Z stderr F system usage: CPU: user:
0.02 s, system: 0.00 s, elapsed: 1.77 s
2020-06-12T19:42:59.704884752Z stderr F system usage: CPU: user:
0.02 s, system: 0.02 s, elapsed: 3.06 s
2020-06-12T19:44:01.928300512Z stderr F system usage: CPU: user:
0.06 s, system: 0.01 s, elapsed: 5.44 s
2020-06-12T19:45:14.124209167Z stderr F system usage: CPU: user:
0.11 s, system: 0.05 s, elapsed: 17.13 s
2020-06-12T19:46:16.28758936Z stderr F system usage: CPU: user: 0.13 s,
system: 0.08 s, elapsed: 19.04 s
2020-06-12T19:47:34.264882626Z stderr F system usage: CPU: user:
0.20 s, system: 0.19 s, elapsed: 36.22 s
2020-06-12T19:49:15.383436343Z stderr F system usage: CPU: user:
0.28 s, system: 0.38 s, elapsed: 74.06 s
2020-06-12T19:53:47.229361981Z stderr F system usage: CPU: user:
0.66 s, system: 1.06 s, elapsed: 214.12 s
2020-06-12T20:19:39.619748109Z stderr F system usage: CPU: user:
4.13 s, system: 5.30 s, elapsed: 1461.16 s
2020-06-12T21:30:31.634637945Z stderr F system usage: CPU: user:
13.73 s, system: 19.59 s, elapsed: 4225.61 s
2020-06-12T23:54:32.511015886Z stderr F system usage: CPU: user:
33.57 s, system: 41.90 s, elapsed: 8514.23 s
2020-06-13T04:23:32.230960572Z stderr F system usage: CPU: user:
66.09 s, system: 82.95 s, elapsed: 16011.25 s
2020-06-13T12:00:37.43434175Z stderr F system usage: CPU: user: 99.42 s,
system: 130.48 s, elapsed: 27296.71 s
2020-06-14T02:40:25.19122979Z stderr F system usage: CPU: user: 202.96 s,
system: 263.66 s, elapsed: 52653.66 s
2020-06-15T01:55:09.03766272Z stderr F system usage: CPU: user: 317.54 s,
system: 544.48 s, elapsed: 83550.21 s

We have several instances of our cloud service, and each instance has it's
own database. On other instances, the autovacuum elapsed is consistently
less than 25 seconds on every run.

Other than the increasing elapsed times for the autovacuum, we don't see
any other indication in the logs of a problem (no error messages, etc).

We're currently using PostgreSQL version 10.10. Our service is JVM-based
and we're using the PostgreSQL JDBC driver version 42.2.5.

Have we stumbled upon a potential bug here, or do we need to tweak some
autovacuum settings? What should we look at next or what should we try
next to further troubleshoot this?

Regards,

Jim Hurne

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Aleš Zelený 2020-06-16 16:29:55 Re: Logical replication - ERROR: could not send data to WAL stream: cannot allocate memory for input buffer
Previous Message Pepe TD Vo 2020-06-16 15:42:18 Re: create batch script to import into postgres tables