Re: postgres 9.0.2 replicated database is crashing

From: akp geek <akpgeek(at)gmail(dot)com>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: postgres 9.0.2 replicated database is crashing
Date: 2013-03-10 00:05:40
Message-ID: CACnhOfLsS7Rub=5ojXifPawdwzbA3UWESCXo9c9p1yV2WKarGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Appreciate your findings. taking your points and doing things now. We can
upgrade. Which version is more stable 9.2 or 9.1

We use GIST indexes quite a bit. and we gis also

I recently compiled postgres 9.2 ..

Regards

On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:

> On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek(at)gmail(dot)com> wrote:
> > thank you. As you mentioned, I understood that I am starting the
> streaming
> > scratch which is not what I wanted to do.
> >
> > Here is what I am planning to .
> >
> > Our replication process was down since March5th.
> >
> > 1. Is it Ok to get all wals from March5th till now to standby pg_xlog
> > 2. take pg_basebackup
> > 3. export the data dir
> > 4. stop backup
> > 5. restart the standby.
> >
> > Based on my understanding it should work, because when standby started,
> it
> > is not able find the files, as I have deleted from the slave them when I
> > tried to fix the issue
>
> Clearly it didn't work because you've been having issues for the past
> 4 days, and you've already tried this approach unsuccessfully. Based
> on the log snippets below, it looks like you have multiple problems.
> First your slave is definitely missing WAL segments. You should
> increase the number of WAL segments that are archived on the master
> and ensure that your base backup is including the WAL segments (by
> generating it with the -x option). However, that's the least of your
> problems at the moment, because it looks like the master is what is
> crashing.
>
> >
> >
> > 00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
> > 00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
> > 00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached
> at
> > 106/E45AD4A8
> > 00000 2013-03-09 04:56:10 GMT LOG: database system is ready to accept
> > read only connections
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> > 00000 2013-03-09 04:56:12 GMT LOG: invalid record length at
> 106/EA10B8C0
> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> > 00000 2013-03-09 04:56:12 GMT LOG: streaming replication successfully
> > connected to primary
> > [unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
> > received: host=[local]
> > [unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
> > received: host=[local]
> >
> > [unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT
> prodLog:
> > duration: 6316.649 ms
> > [unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT
> prodLog:
> > disconnection: session time: 0:41:06.529 user=postgres database=fprod
> host=
> > 10.155.253.43 port=51257
> > 00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
> > 00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419
> > buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
> > 00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
> > 107/FB01B238
> > 00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was
> at
> > log time 2013-03-09 08:25:41.85776+00
> > 00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
> > psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF
> on
> > client connection
> > psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
> > session time: 2:15:06.351 user=postgres database=fprod host=[local]
> > XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
> page in
> > 100924/100937/1225845
> > XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
> > 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf
> F
> > isdelete
> > F updateBlkno:36483
> > 00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
> > with exit code 1
> > 00000 2013-03-09 09:23:46 GMT LOG: terminating any other active
> server
> > processes
>
> The real problem is here:
> XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
> page in 100924/100937/1225845
> XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
> node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
> F isleaf F isdelete F updateBlkno:36483
>
> That looks like something is badly broken (maybe data corruption)?
> Most of the google hits on that error are associated with GIST
> indexes. Are you using GIST indexes?
>
> Are you really (still) using 9.0.4 ? I hope you realize that there
> are known data corruption bugs in that version, and that version is
> super old at this point. You really need to update to 9.0.12.
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lonni J Friedman 2013-03-10 00:11:32 Re: postgres 9.0.2 replicated database is crashing
Previous Message Lonni J Friedman 2013-03-09 22:09:42 Re: postgres 9.0.2 replicated database is crashing