Re: postgres 9.0.2 replicated database is crashing

From: Lonni J Friedman <netllama(at)gmail(dot)com>
To: akp geek <akpgeek(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: postgres 9.0.2 replicated database is crashing
Date: 2013-03-09 22:09:42
Message-ID: CAP=oouFx+tH-9v8o6WzHPz5rtF489OhdBUS0FH8fUfDJXDUnsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek(at)gmail(dot)com> wrote:
> thank you. As you mentioned, I understood that I am starting the streaming
> scratch which is not what I wanted to do.
>
> Here is what I am planning to .
>
> Our replication process was down since March5th.
>
> 1. Is it Ok to get all wals from March5th till now to standby pg_xlog
> 2. take pg_basebackup
> 3. export the data dir
> 4. stop backup
> 5. restart the standby.
>
> Based on my understanding it should work, because when standby started, it
> is not able find the files, as I have deleted from the slave them when I
> tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully. Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments. You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option). However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

>
>
> 00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
> 00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
> 00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached at
> 106/E45AD4A8
> 00000 2013-03-09 04:56:10 GMT LOG: database system is ready to accept
> read only connections
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> 00000 2013-03-09 04:56:12 GMT LOG: invalid record length at 106/EA10B8C0
> cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> 00000 2013-03-09 04:56:12 GMT LOG: streaming replication successfully
> connected to primary
> [unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
> received: host=[local]
> [unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
> received: host=[local]
>
> [unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT prodLog:
> duration: 6316.649 ms
> [unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT prodLog:
> disconnection: session time: 0:41:06.529 user=postgres database=fprod host=
> 10.155.253.43 port=51257
> 00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
> 00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419
> buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
> 00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
> 107/FB01B238
> 00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was at
> log time 2013-03-09 08:25:41.85776+00
> 00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
> psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF on
> client connection
> psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
> session time: 2:15:06.351 user=postgres database=fprod host=[local]
> XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in
> 100924/100937/1225845
> XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
> 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F
> isdelete
> F updateBlkno:36483
> 00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
> with exit code 1
> 00000 2013-03-09 09:23:46 GMT LOG: terminating any other active server
> processes

The real problem is here:
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes. Are you using GIST indexes?

Are you really (still) using 9.0.4 ? I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point. You really need to update to 9.0.12.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message akp geek 2013-03-10 00:05:40 Re: postgres 9.0.2 replicated database is crashing
Previous Message akp geek 2013-03-09 21:51:02 Re: postgres 9.0.2 replicated database is crashing