Quick Links

"missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

From:	<fredrik(at)huitfeldt(dot)com>
To:	<pgsql-general(at)postgresql(dot)org>
Subject:	"missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.
Date:	2016-03-02 11:54:38
Message-ID:	1456919678340.31300.116900@webmail2
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi All,

we are running postgresql 9.1.15 on Debian.

we are, basically, running a postgresql cluster with two nodes. We are using synchronous streaming replication to make sure that the slave is always fully in sync (using a recovery.conf that points out the master). The slave is mainly used to maintain high availability. We also run backups on the slave to reduce load on the master.

On both nodes, postgresqls data is stored on a SAN.

At some point, we restarted the slave. The first thing it does is to do a pg_basebackup and then start streaming changes based on that. This seemed to work fine, until 16 hrs later when we made a backup on the slave, using pg_dump. pg_dump was unable to complete, due to a pg_toast error (ERROR: missing chunk number 0 for toast value 2753291 in pg_toast_22066).

The issue remained until we ran a full vacuum analyze on the cluster.

We have been running in this way on several clusters for some years now, basically since 9.1 was released, without seeing this issue. Since we upgraded to postgresql 9.1.15, we have seen it twice. This does not necessarily mean anything, but I thought I would mention it.

The issue is resolved now, but I would still like to understand what happened.
I have logfiles from the incident, but I cannot see anything out of the ordinary (despite having a fair amount of experience investigating postgresql logs).

I have read that this kind of issues are most frequently due to hardware issues or bugs in postgresql.

Using <https://bucardo.org/postgres_all_versions.html#version_9.1.15>,
I have reviewed the release notes from the subsequent 9.1 releases (9.1.15-9.1.19),
but I have been unable to identify anything that hints at the issue that we have experienced.

the closest one would be:
(9.1.16,9.4.2,9.3.7,9.2.11,9.0.20) Fix possible failure during hash index bucket split, if other processes are modifying the index concurrently (Tom Lane)

but that seems like a long shot.

My question is:
- does anyone know of bug reports (fixed or otherwise) that could shed light on our issue?
- does anyone have experience with this kind of behavior?

- our setup is fairly automated; would it be a good idea to always run vacuum analyze after a pg_restore (or before pg_dump)?

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alexander Farber	2016-03-02 12:39:45	Re: How to ensure that a stored function always returns TRUE or FALSE?
Previous Message	Albe Laurenz	2016-03-02 11:35:54	Re: Does RAISE EXCEPTION rollback previous commands in a stored function?