Re: Critical failure of standby

From: James Sewell <james(dot)sewell(at)jirotech(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Critical failure of standby
Date: 2016-08-15 03:38:19
Message-ID: CAANVwEshmYmhqLW=3zS09maU1Yt36w+LsFAB6Jy8SvzVQUR3Pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,

The thing which I find a little worrying is that this 'corruption' was
introduced either on the network from PROD -> DR, but then also cascaded to
both other DR servers (either via replication or via archive_command).

Is WAL corruption checked for in any way on standby servers?.

Here is a link to a diagram of the current environment:
http://imgur.com/a/MoKMo

I'll look into patching for a core-dump.

Cheers,

James Sewell,
Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Sat, Aug 13, 2016 at 5:20 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:

> James Sewell wrote:
>
> > 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client= (0:00000)LOG:
> consistent recovery state reached at 3/8811DFF0
> > 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=
> (0:XX000)FATAL: invalid memory alloc request size 3445219328
> > 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client= (0:00000)LOG:
> database system is ready to accept read only connections
> > 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client= (0:00000)LOG:
> startup process (PID 23614) exited with exit code 1
> > 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client= (0:00000)LOG:
> terminating any other active server processes
> > 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client= (0:00000)LOG:
> archiver process (PID 23627) exited with exit code 1
>
> What version is this?
>
> Hm, so the startup process finds the consistent point (which signals
> postmaster so that line 23612/3 says "ready to accept read-only conns")
> and immediately dies because of the invalid memory alloc error. I
> suppose that error must be while trying to process some xlog record, but
> without a xlog address it's difficult to say anything. I suppose you
> could try to pg_xlogdump WAL starting at the last known good address
> 3/8811DFF0 but I wouldn't know what to look for.
>
> One strange thing is that xlog replay sets up an error context, so you
> would have had a line like "xlog redo HEAP" etc, but there's nothing
> here. So maybe the allocation is not exactly in xlog replay, but
> something different. We'd need to see a backtrace in order to see what.
> Since this occurs in the startup process, probably the easiest way is to
> patch the source to turn that error into PANIC, then re-run and examine
> the resulting core file.
>
> --
> Álvaro Herrera http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

--

------------------------------
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message zh1029 2016-08-15 04:33:15 RowExclusiveLock timeout while autovacuum
Previous Message Joy Arulraj 2016-08-14 23:40:40 Re: C++ port of Postgres