pg_dump generating unrestorable data (8.4)

From: Glenn Maynard <glenn(at)zewt(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: pg_dump generating unrestorable data (8.4)
Date: 2011-04-03 18:49:00
Message-ID: BANLkTin5ukCPGpTbtCu84EUNBiTmrB0YSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

After dumping a database (pg_dump -F c database > dump), trying to restore
it (pg_restore dump) gives:

> pg_restore: [archiver (db)] Error from TOC entry 2463; 0 58451 TABLE DATA
table user
> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for
encoding "UTF8": 0xe3273a
> HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
> CONTEXT: COPY table, line 1

The surface reason for this is clear enough: invalid UTF-8 data crept into
some tsvector columns. This is a much more serious problem, however: the
backup tools for the database are, without warning, generating data that
can't be restored.

When in a data recovery situation, a backup that won't restore is
catastrophic. I can't restore the database to a state it was in at the time
of the backup; I have to spend hours of downtime figuring out what to do to
make something usable out of my backup; and then I have to hope I've
corrected the backup correctly before bringing the server back online. (If
I was in an actual backup recovery situation--fortunately I'm not--I'd be
more inclined to edit the Postgresql source to disable this check while
restoring the backup than to risk trying to manually fix the backup data
directly, which is very easy to get wrong.)

Recommendations:

- So long as there's any possibility of this happening, it should be
possible to force Postgresql to ignore this error, to guarantee that backups
can be restored. If the database allowed this situation to happen in the
first place, then it should allow it to be restored from a backup too.
- There was no warning of any problem when the backup was made. pg_dump
should warn about potential recovery problems in the data it's outputting.
This should be a fatal error unless explicitly overridden, so backups that
can't be restored won't be generated accidentally.

I don't know how this data entered the database in the first place. The
problematic data is (so far) entirely in tsvectors, generated from
to_tsvector and/or tsvector_update_trigger, but I havn't reproduced it.
Updating the rows in question on the server (to cause its
tsvector_update_trigger to be fired) fixes the problematic tsvector
columns. For what it's worth, 0xe3273a in the dump is within the string
"'x':17" where x is \xe3, the first byte of the UTF-8 representation of
U+30FC "ー". (If this sounds like a known or fixed problem I'd be interested
to know, but this sort of problem in a minor subsystem like FTS shouldn't be
able to silently break backups in the first place.)

--
Glenn Maynard

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joshua D. Drake 2011-04-03 20:39:46 Re: Autovacuum firing up during my manual vacuum on same table
Previous Message Thomas Kellerer 2011-04-03 14:25:35 Re: Table lock while adding a column and clients are logged in