Re: postgres crash SOS

From: Felde Norbert <fenor77(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: postgres crash SOS
Date: 2010-06-18 16:14:55
Message-ID: AANLkTik-OUox_1X6kTJgruzy6pTOhhysH6izTd2pGev8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

This is my opinion, tell me if I am wrong.
the backup runs at 00:30. At that time, i am sure nobody use the
program which use postgres, so there is no transaction which store
would be a problem when cobian creates the snapshot.
Cobian started one day to create the backup but the volume was full
that caused that cobian hanged up.
Cobian was still runing and slowly filled the virtual memory and when
it was full, something went wrong with a postgres transaction. After
that point no backup could made, but what I am not understand is how
could postgres still serve data to our application.

We use this backup method for 3 years and nothing happend. Ok, I know,
we was lucky or something else, but I think it is long time if it is
so wrong.

Thank you for your suggestions and advices, I will see what can I do.

fenor

2010/6/18 Merlin Moncure <mmoncure(at)gmail(dot)com>:
> On Fri, Jun 18, 2010 at 4:55 AM, Felde Norbert <fenor77(at)gmail(dot)com> wrote:
>> Hi,
>>
>> This are the informations I could collect:
>>
>>
>> We use cobian to create the backup.
>> There are two volumes in use, on C is the volume where everything is
>> installed and here is the postgres data dir too.
>> The postgres backup that runs everynight places the backup file on
>> this volume too, it runs before daily backup is started.
>> There is an another volume where the cobian places the daily backups.
>>
>> So to be precise:
>> C:
>>        postgres
>>        postgres\data
>>        postgres dump before daily backup is started
>> D:
>>        daily backups including postgres dump from C
>>
>> The D volume was full on the 06-06 and stayd so for 5 days.
>>
>> The first virtual memory log entry happend on the 06-09 05:41 and the
>> last came 06-10 16:18
>> The log entries are about the same:
>> Windows successfully diagnosed a low virtual memory condition.
>> The following programs consumed the most virtual memory:
>> cbService.exe (2348) consumed 2058158080 bytes,
>> explorer.exe (7136) consumed 245456896 bytes,
>> and McScript_InUse.exe (1908) consumed 218529792 bytes
>>
>>
>> In the postgres log at that time is this:
>> Postgres log
>> 2010-06-10 16:58:14 LOG:  database system was interrupted at 2010-06-10 16:16:36
>> 2010-06-10 16:58:14 LOG:  checkpoint record is at 0/9FBE5158
>> 2010-06-10 16:58:14 LOG:  redo record is at 0/9FBE5158; undo record is
>> at 0/0; shutdown FALSE
>> 2010-06-10 16:58:14 LOG:  next transaction ID: 0/3620193; next OID: 6744703
>> 2010-06-10 16:58:14 LOG:  next MultiXactId: 2; next MultiXactOffset: 3
>> 2010-06-10 16:58:14 LOG:  database system was not properly shut down;
>> automatic recovery in progress
>> 2010-06-10 16:58:14 LOG:  redo starts at 0/9FBE51A8
>> 2010-06-10 16:58:14 FATAL:  the database system is starting up
>> 2010-06-10 16:58:14 LOG:  record with zero length at 0/9FEEDF60
>> 2010-06-10 16:58:14 LOG:  redo done at 0/9FEEDF30
>> 2010-06-10 16:58:15 FATAL:  the database system is starting up
>> 2010-06-10 16:58:16 FATAL:  the database system is starting up
>> 2010-06-10 16:58:17 FATAL:  the database system is starting up
>> 2010-06-10 16:58:17 LOG:  database system is ready
>> Before this I can not find any interesting entries in the postgres log.
>>
>>
>> The first postgres backup that failed was on 06-11 00:30. The log is
>> filled with that message:
>> 2010-06-11 00:31:19 ERROR:  xlog flush request 0/9FF74848 is not
>> satisfied --- flushed only to 0/9FEEDFB0
>> 2010-06-11 00:31:19 CONTEXT:  writing block 17942 of relation
>> 1663/4192208/4192534
>> 2010-06-11 00:31:19 STATEMENT:  FETCH 100 FROM _pg_dump_cursor.
>> This message appears in 1 sec intervals and only the writing blocks
>> blocknumber changes.
>>
>>
>>
>> About the informations you asked:
>> There are 2 SCSI drives and they are mirrored using windows mirroring.
>> As I could find out, the mirroring is done with default settings.
>> The fsync settings are the default.
>
> My guess here is that you are running into a bad interaction between
> postgres, your backup software, and windows.  Cobian like most windows
> backup software products uses vss which snapshots the volume during
> the backup.  While that is happening writes are windows has to keep
> track of files written to in a different place.  Something blew a fuse
> in windows and the database was reset (I'm betting your server
> completely reset in addition to postgres).
>
> This is pretty nasty because it's likely to happen again and your
> storage looks like it's no properly honoring fsync.  My advice would
> be to double check the fsync situation first so you don't lose any
> more data (check windows write caching, and your scsi controller cache
> setting).  After that, I think you need to take a look at your backup
> strategy. Snapshotting the volume postgres is on is probably not a
> good idea, especially if you are relying on pg_dump for the backup.
>
> merlin
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Evan Carroll 2010-06-18 16:20:09 New COPY command (alternative to pg_loader) and Perl shell
Previous Message Andy Dale 2010-06-18 15:31:37 Re: Unable to determine what has a particular OID