From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Felde Norbert <fenor77(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: postgres crash SOS |
Date: | 2010-06-18 13:03:18 |
Message-ID: | AANLkTilasBCtsrxowDyLGQljFWWRhzkDbwL0oy1ixWI_@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Jun 18, 2010 at 4:55 AM, Felde Norbert <fenor77(at)gmail(dot)com> wrote:
> Hi,
>
> This are the informations I could collect:
>
>
> We use cobian to create the backup.
> There are two volumes in use, on C is the volume where everything is
> installed and here is the postgres data dir too.
> The postgres backup that runs everynight places the backup file on
> this volume too, it runs before daily backup is started.
> There is an another volume where the cobian places the daily backups.
>
> So to be precise:
> C:
> postgres
> postgres\data
> postgres dump before daily backup is started
> D:
> daily backups including postgres dump from C
>
> The D volume was full on the 06-06 and stayd so for 5 days.
>
> The first virtual memory log entry happend on the 06-09 05:41 and the
> last came 06-10 16:18
> The log entries are about the same:
> Windows successfully diagnosed a low virtual memory condition.
> The following programs consumed the most virtual memory:
> cbService.exe (2348) consumed 2058158080 bytes,
> explorer.exe (7136) consumed 245456896 bytes,
> and McScript_InUse.exe (1908) consumed 218529792 bytes
>
>
> In the postgres log at that time is this:
> Postgres log
> 2010-06-10 16:58:14 LOG: database system was interrupted at 2010-06-10 16:16:36
> 2010-06-10 16:58:14 LOG: checkpoint record is at 0/9FBE5158
> 2010-06-10 16:58:14 LOG: redo record is at 0/9FBE5158; undo record is
> at 0/0; shutdown FALSE
> 2010-06-10 16:58:14 LOG: next transaction ID: 0/3620193; next OID: 6744703
> 2010-06-10 16:58:14 LOG: next MultiXactId: 2; next MultiXactOffset: 3
> 2010-06-10 16:58:14 LOG: database system was not properly shut down;
> automatic recovery in progress
> 2010-06-10 16:58:14 LOG: redo starts at 0/9FBE51A8
> 2010-06-10 16:58:14 FATAL: the database system is starting up
> 2010-06-10 16:58:14 LOG: record with zero length at 0/9FEEDF60
> 2010-06-10 16:58:14 LOG: redo done at 0/9FEEDF30
> 2010-06-10 16:58:15 FATAL: the database system is starting up
> 2010-06-10 16:58:16 FATAL: the database system is starting up
> 2010-06-10 16:58:17 FATAL: the database system is starting up
> 2010-06-10 16:58:17 LOG: database system is ready
> Before this I can not find any interesting entries in the postgres log.
>
>
> The first postgres backup that failed was on 06-11 00:30. The log is
> filled with that message:
> 2010-06-11 00:31:19 ERROR: xlog flush request 0/9FF74848 is not
> satisfied --- flushed only to 0/9FEEDFB0
> 2010-06-11 00:31:19 CONTEXT: writing block 17942 of relation
> 1663/4192208/4192534
> 2010-06-11 00:31:19 STATEMENT: FETCH 100 FROM _pg_dump_cursor.
> This message appears in 1 sec intervals and only the writing blocks
> blocknumber changes.
>
>
>
> About the informations you asked:
> There are 2 SCSI drives and they are mirrored using windows mirroring.
> As I could find out, the mirroring is done with default settings.
> The fsync settings are the default.
My guess here is that you are running into a bad interaction between
postgres, your backup software, and windows. Cobian like most windows
backup software products uses vss which snapshots the volume during
the backup. While that is happening writes are windows has to keep
track of files written to in a different place. Something blew a fuse
in windows and the database was reset (I'm betting your server
completely reset in addition to postgres).
This is pretty nasty because it's likely to happen again and your
storage looks like it's no properly honoring fsync. My advice would
be to double check the fsync situation first so you don't lose any
more data (check windows write caching, and your scsi controller cache
setting). After that, I think you need to take a look at your backup
strategy. Snapshotting the volume postgres is on is probably not a
good idea, especially if you are relying on pg_dump for the backup.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-06-18 14:06:06 | Re: Unable to determine what has a particular OID |
Previous Message | Ashesh Vashi | 2010-06-18 12:40:41 | Re: How to install 8.4 in Fedora 11 Leonidas |