Re: Possible causes for database corruption and solutions

From: Michael Clark <codingninja(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Possible causes for database corruption and solutions
Date: 2009-12-16 13:19:37
Message-ID: bf5d83510912160519h5938e119kb189a70bb2967df3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Craig - thanks for the reply. I will reply below.

On Tue, Dec 15, 2009 at 9:34 PM, Craig Ringer
<craig(at)postnewspapers(dot)com(dot)au>wrote:

> On 16/12/2009 6:39 AM, Michael Clark wrote:
>
>> Hello all,
>>
>> Over the past 6 months or so I have posted to the list a couple times
>> looking for information regarding recovering databases from corruption.
>> At the time the incidents of corruption among our users was very low,
>> but the frequency is starting to increase, most likely due to the
>> increased user base that has upgraded to the version of our software
>> that uses Postgres.
>>
>> I have a couple questions I am hoping to get some feedback on.
>>
>
> Secondly, I ask about an alternative solution to the corruption problem
>> because with preliminary testing we have seen a significant degradation
>> in performance.
>>
>
> From changing to fsync_writethrough ? That's a good thing - it suggests
> that maybe now the data is actually hitting disk when Pg asks it to.
>
> You can have fast(er), or safe, but not both. Now that your database is
> actually doing what it should be and truthfully promising that data has hit
> disk when you commit, you may have to adopt some strategies to reduce the
> number of very short repetitive transactions you perform.
>
>
That is true, this does bode well for us finding our cause. And that old
axiom does often ring true.

( Should Pg perhaps detect OS X and switch the default to fsync_writethrough
> ? Or are the "test to see if fsync() works on startup" plans going anywhere?
> )
>
>
Not sure if you were directing that to me specifically, I feel probably to
the general PG public.
But, this is probably a lot less of a problem in a proper server
environment, which most of our users are not in. (I touch more on that to in
my reply to you and Scott).
For example, we run our software on a Mac Pro, with raid 1 and the machine
is protected by UPS. We have never had an issue ourselves.
I guess it depends on the majority use case for PG on OS X whether this
should be defaulted to on.
(Although siding with caution usually never hurts when data is concerned.)

>
> I then restore the dump into a newly created DB (createdb -E UTF8 Test1)
>> with the command:
>> pg_restore -F c -d Test1 pgbackup
>>
>
> Make sure to do the restore in a single transaction. It will be
> *enormously*, *massively* faster. See the "--single-transaction" option to
> pg_restore.
>
>
Thanks for that tip, cut the restore in half in this particular test.

> If you can't do that, then set a commit_delay so that PostgreSQL can batch
> the fsync()s for commmits together. See postgresql.conf and the
> documentation for commit_delay and commit_siblings.
>
>
>
I will look those up as well, thanks.

> I am hoping there may be an alternative to fsync_writethrough and/or we
>> are barking up the wrong tree w.r.t. the cause of the corruptions.
>>
>
> If fsync_writethrough slowed things down that much then you've almost
> certainly nailed the cause of the corruptions. Now you just need to tune
> your DB, and adapt how you use the DB, so that you're less affected by the
> necessary performance hit imposed by safe and reliable use of disk storage.
>
>
It definitely feels better knowing that we likely found our issue, and you
are right, time to tune as best we can.
Luckily these sort of large operations are generally the smaller percentage
of use cases.

Thanks again for the reply,
Michael

> --
> Craig Ringer
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Mai Fawzy 2009-12-16 13:34:50 Re: Cause of error message?
Previous Message yuliada 2009-12-16 12:56:16 Slow select