Re: Lost rows/data corruption?

From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Marco Colombo" <pgsql(at)esiway(dot)net>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 00:27:54
Message-ID: 021e01c513be$5b20d700$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

fsync is on for all these boxes. Our customers run their own hardware with
many different specification of hardware in use. Many of our customers don't
have UPS, although their power is probably pretty reliable (normal city
based utilities), but of course I can't guarantee they don't get an outage
once in a while with a thunderstorm etc.

The problem here is that we are consistently seeing the same kind of
corruption and symptoms across a fairly large number of customers (52 have
reported this problem), so there is something endemic happening here that to
be honest, I'm surprised no one else is seeing. Fundamentally there is
nothing particularly abnormal with our application or data, but regardless,
I would have thought these kind of things (application design, data
representation etc) irrelevant to the reliability of the database not to
allow duplicate data on a primary key. Something is causing this corruption,
and one thing we do know is that it doesn't happen immediately with a new
installation, it takes time (several months of usage) before we start to see
this condition. I'd be really surprised if XFS is the problem as I know
there are plenty of other people across the world using it reliability with
PG.

We're going to see if we can build a test environment that can forcibly
cause this but I don't hold much hope, as we've tried to isolate it before
with little success. Here's what we tried changing when we originally went
searching for the problem, and it still here:

- the hardware (tried single CPU instead of dual - though that maybe an
issue with the OS)
- the OS version (tried Linux 2.6.5, 2.6.6, 2.6.7, 2.6.8.1, 2.6.10 and
2.4.22) - all using XFS
- the database table layout (tried changing the way the data is stored)
- the version of Jetty (servlet engine)
- the DB pool manager and PG JDBC driver versions
- the version of PG (tried two or three back from the latest)
- various vacuum regimes

----- Original Message -----
From: "Marco Colombo" <pgsql(at)esiway(dot)net>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: <pgsql-general(at)postgresql(dot)org>
Sent: Wednesday, February 16, 2005 2:58 AM
Subject: Re: Lost rows/data corruption?

> On Tue, 15 Feb 2005, Andrew Hall wrote:
>
>>
>>
>>>> It sounds like a mess, all right. Do you have a procedure to follow to
>>>> replicate this havoc? Are you sure there's not a hardware problem
>>>> underlying it all?
>>>>
>>>> regards, tom lane
>>>>
>>
>> We haven't been able to isolate what causes it but it's unlikely to be
>> hardware as it happens on quite a few of our customer's boxes. We also
>> use
>> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
>> power-outages. Any ideas as to how I might go about isolating this? Have
>> you
>> heard any other reports of this kind and suggested remedies?
>
> Are you running with fsync = off? and did the hosts experience any
> power-outage recently?
>
> .TM.
> --
> ____/ ____/ /
> / / / Marco Colombo
> ___/ ___ / / Technical Manager
> / / / ESI s.r.l.
> _____/ _____/ _/ Colombo(at)ESI(dot)it
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Fuhr 2005-02-16 03:13:55 Need to check palloc() return value?
Previous Message Jim C. Nasby 2005-02-15 21:39:51 Re: PostgreSQL vs. MySQL vs. Oracle, 2005 report card