From: | akp geek <akpgeek(at)gmail(dot)com> |
---|---|
To: | Lonni J Friedman <netllama(at)gmail(dot)com> |
Cc: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: postgres 9.0.2 replicated database is crashing |
Date: | 2013-03-10 01:20:40 |
Message-ID: | CACnhOfJCDGaOWMjULqrk1CqgEAzQZ9Yz0J4W6ge_ySUs6xNmeQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
thanks a lot. One final question on this.
if the streaming fails as in our case, we missed 4 days worth of data. We
archive the log files on the primary. if we were to restart , what should
be starting the point ?
Do we need to get all the archived logs to the pg_xlogs from the date ( in
our case 03/05 ) . This is where I got confused.
Again thanks a lot. Now we will definitely upgrade
Regards
On Sat, Mar 9, 2013 at 7:11 PM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
> On Sat, Mar 9, 2013 at 4:05 PM, akp geek <akpgeek(at)gmail(dot)com> wrote:
> > Appreciate your findings. taking your points and doing things now. We
> can
> > upgrade. Which version is more stable 9.2 or 9.1
>
> They're two entirely different branches. They should both be equally
> stable. However, if you're looking for the most straightforward path
> I'd recommend going to 9.0.12. Also be sure to read the release notes
> first.
>
> >
> > We use GIST indexes quite a bit. and we gis also
> >
> > I recently compiled postgres 9.2 ..
> >
> > Regards
> >
> >
> >
> > On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman <netllama(at)gmail(dot)com>
> wrote:
> >>
> >> On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek(at)gmail(dot)com> wrote:
> >> > thank you. As you mentioned, I understood that I am starting the
> >> > streaming
> >> > scratch which is not what I wanted to do.
> >> >
> >> > Here is what I am planning to .
> >> >
> >> > Our replication process was down since March5th.
> >> >
> >> > 1. Is it Ok to get all wals from March5th till now to standby pg_xlog
> >> > 2. take pg_basebackup
> >> > 3. export the data dir
> >> > 4. stop backup
> >> > 5. restart the standby.
> >> >
> >> > Based on my understanding it should work, because when standby
> started,
> >> > it
> >> > is not able find the files, as I have deleted from the slave them
> when I
> >> > tried to fix the issue
> >>
> >> Clearly it didn't work because you've been having issues for the past
> >> 4 days, and you've already tried this approach unsuccessfully. Based
> >> on the log snippets below, it looks like you have multiple problems.
> >> First your slave is definitely missing WAL segments. You should
> >> increase the number of WAL segments that are archived on the master
> >> and ensure that your base backup is including the WAL segments (by
> >> generating it with the -x option). However, that's the least of your
> >> problems at the moment, because it looks like the master is what is
> >> crashing.
> >>
> >> >
> >> >
> >> > 00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
> >> > 00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
> >> > 00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state
> reached
> >> > at
> >> > 106/E45AD4A8
> >> > 00000 2013-03-09 04:56:10 GMT LOG: database system is ready to
> >> > accept
> >> > read only connections
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> >> > 00000 2013-03-09 04:56:12 GMT LOG: invalid record length at
> >> > 106/EA10B8C0
> >> > cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
> >> > 00000 2013-03-09 04:56:12 GMT LOG: streaming replication
> >> > successfully
> >> > connected to primary
> >> > [unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
> >> > received: host=[local]
> >> > [unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
> >> > received: host=[local]
> >> >
> >> > [unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT
> >> > prodLog:
> >> > duration: 6316.649 ms
> >> > [unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT
> >> > prodLog:
> >> > disconnection: session time: 0:41:06.529 user=postgres database=fprod
> >> > host=
> >> > 10.155.253.43 port=51257
> >> > 00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
> >> > 00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote
> >> > 19419
> >> > buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
> >> > 00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
> >> > 107/FB01B238
> >> > 00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction
> was
> >> > at
> >> > log time 2013-03-09 08:25:41.85776+00
> >> > 00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
> >> > psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected
> EOF
> >> > on
> >> > client connection
> >> > psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog:
> disconnection:
> >> > session time: 2:15:06.351 user=postgres database=fprod host=[local]
> >> > XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
> >> > page in
> >> > 100924/100937/1225845
> >> > XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
> node:
> >> > 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F
> isleaf
> >> > F
> >> > isdelete
> >> > F updateBlkno:36483
> >> > 00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880)
> exited
> >> > with exit code 1
> >> > 00000 2013-03-09 09:23:46 GMT LOG: terminating any other active
> >> > server
> >> > processes
> >>
> >> The real problem is here:
> >> XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
> >> page in 100924/100937/1225845
> >> XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
> >> node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
> >> F isleaf F isdelete F updateBlkno:36483
> >>
> >> That looks like something is badly broken (maybe data corruption)?
> >> Most of the google hits on that error are associated with GIST
> >> indexes. Are you using GIST indexes?
> >>
> >> Are you really (still) using 9.0.4 ? I hope you realize that there
> >> are known data corruption bugs in that version, and that version is
> >> super old at this point. You really need to update to 9.0.12.
> >
> >
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> L. Friedman netllama(at)gmail(dot)com
> LlamaLand https://netllama.linux-sxs.org
>
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2013-03-10 02:19:23 | Re: table spaces |
Previous Message | Lonni J Friedman | 2013-03-10 00:11:32 | Re: postgres 9.0.2 replicated database is crashing |