Re: Possible data corruption with Postgres 7.4.8

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Eric B(dot) Ridge" <ebr(at)tcdi(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Joey Adams <jea(at)tcdi(dot)com>
Subject: Re: Possible data corruption with Postgres 7.4.8
Date: 2006-03-14 04:12:55
Message-ID: 15378.1142309575@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Eric B. Ridge" <ebr(at)tcdi(dot)com> writes:
> Does anyone here have any kind of explanation other than bad hardware?

Well, there are several data-corruption bugs fixed between 7.4.8 and
7.4.12, though whether any of them explains your symptoms is difficult
to say:

2005-11-02 19:23 tgl

* src/backend/access/transam/slru.c (REL7_4_STABLE): Fix
longstanding race condition in transaction log management: there
was a very narrow window in which SimpleLruReadPage or
SimpleLruWritePage could think that I/O was needed when it wasn't
(and indeed the buffer had already been assigned to another page).
This would result in an Assert failure if Asserts were enabled, and
probably in silent data corruption if not. Reported independently
by Jim Nasby and Robert Creager.

I intend a more extensive fix when 8.2 development starts, but this
is a reasonably low-impact patch for the existing branches.

2005-08-25 18:07 tgl

* src/: backend/access/heap/heapam.c, backend/commands/async.c,
backend/commands/trigger.c, backend/commands/vacuum.c,
backend/executor/execMain.c, backend/utils/time/tqual.c,
include/access/heapam.h, include/executor/executor.h
(REL7_4_STABLE): Back-patch fixes for problems with VACUUM
destroying t_ctid chains too soon, and with insufficient paranoia
in code that follows t_ctid links. This patch covers the 7.4
branch.

2005-05-07 17:33 tgl

* src/backend/: access/heap/hio.c, access/nbtree/nbtpage.c,
access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE):
Repair very-low-probability race condition between relation
extension and VACUUM: in the interval between adding a new page to
the relation and formatting it, it was possible for VACUUM to come
along and decide it should format the page too. Though not harmful
in itself, this would cause data loss if a third transaction were
able to insert tuples into the vacuumed page before the original
extender got control back.

2005-05-07 17:23 tgl

* src/backend/utils/time/tqual.c (REL7_4_STABLE): Adjust time qual
checking code so that we always check TransactionIdIsInProgress
before we check commit/abort status. Formerly this was done in
some paths but not all, with the result that a transaction might be
considered committed for some purposes before it became committed
for others. Per example found by Jan Wieck.

The relation-extension race condition could explain recently-added
tuples simply disappearing, though if it happened in more than one table
you'd have to assume that the race condition window got hit more than
once. The slru race condition is even narrower, but if it hit then it
could cause tuples inserted by the same transaction into different
tables to become lost. Either of these seem to match your symptoms?

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Eric B. Ridge 2006-03-14 05:10:49 Re: Possible data corruption with Postgres 7.4.8
Previous Message Eric B. Ridge 2006-03-14 03:33:50 Possible data corruption with Postgres 7.4.8