Re: Critical failure of standby

From: James Sewell <james(dot)sewell(at)jirotech(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Critical failure of standby
Date: 2016-08-12 21:54:48
Message-ID: CAANVwEuuQXrSOP4zVMpC5Kpoiv62oHyL8NKGhGwiq-NkzKdwVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

And a diagram of how it hangs together.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Sat, Aug 13, 2016 at 7:54 AM, James Sewell <james(dot)sewell(at)jirotech(dot)com>
wrote:

> (from other thread)
>
>
> - 9.5.3
> - Redhat 7.2 on VMWare
> - Single PostgreSQL instance one each machine
> - Every machine in DR became corrupt, so interestingly this must have
> been sent to the two cascading nodes via WAL before the crash on the hub DR
> node
> - No OS logs indicating anything abnormal
>
> I think the key looks like the (legitimate) loss of network to the Prod
> master, then:
>
> (0:XX000)FATAL: invalid memory alloc request size 3445219328
>
> Everything seems to go wrong from there. Are WAL segments checked for
> integrity once they are received?
>
> James Sewell,
> PostgreSQL Team Lead / Solutions Architect
>
>
>
> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
> *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com
> *F *(+61) 2 8099 9099 <(+61)%202%208099%209000>
>
> On Sat, Aug 13, 2016 at 7:43 AM, James Sewell <james(dot)sewell(at)jirotech(dot)com>
> wrote:
>
>> It's on 9.5.3.
>>
>> I've actually created this mail twice (I sent once as an unregistered
>> address and assumed it would be dropped). I sent a diagram to the other
>> one, I'll forward that mail here now for completeness.
>>
>> Cheers,
>>
>> James Sewell,
>> PostgreSQL Team Lead / Solutions Architect
>>
>>
>>
>> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
>> *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com
>> *F *(+61) 2 8099 9099 <(+61)%202%208099%209000>
>>
>> On Sat, Aug 13, 2016 at 5:20 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com
>> > wrote:
>>
>>> James Sewell wrote:
>>>
>>> > 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=
>>> (0:00000)LOG: consistent recovery state reached at 3/8811DFF0
>>> > 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=
>>> (0:XX000)FATAL: invalid memory alloc request size 3445219328
>>> > 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=
>>> (0:00000)LOG: database system is ready to accept read only connections
>>> > 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=
>>> (0:00000)LOG: startup process (PID 23614) exited with exit code 1
>>> > 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=
>>> (0:00000)LOG: terminating any other active server processes
>>> > 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=
>>> (0:00000)LOG: archiver process (PID 23627) exited with exit code 1
>>>
>>> What version is this?
>>>
>>> Hm, so the startup process finds the consistent point (which signals
>>> postmaster so that line 23612/3 says "ready to accept read-only conns")
>>> and immediately dies because of the invalid memory alloc error. I
>>> suppose that error must be while trying to process some xlog record, but
>>> without a xlog address it's difficult to say anything. I suppose you
>>> could try to pg_xlogdump WAL starting at the last known good address
>>> 3/8811DFF0 but I wouldn't know what to look for.
>>>
>>> One strange thing is that xlog replay sets up an error context, so you
>>> would have had a line like "xlog redo HEAP" etc, but there's nothing
>>> here. So maybe the allocation is not exactly in xlog replay, but
>>> something different. We'd need to see a backtrace in order to see what.
>>> Since this occurs in the startup process, probably the easiest way is to
>>> patch the source to turn that error into PANIC, then re-run and examine
>>> the resulting core file.
>>>
>>> --
>>> Álvaro Herrera http://www.2ndQuadrant.com/
>>> <http://www.2ndquadrant.com/>
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>
>>
>>
> James Sewell,
> PostgreSQL Team Lead / Solutions Architect
>
>
>
> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
> *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com
> *F *(+61) 2 8099 9099 <(+61)%202%208099%209000>
>
> On Sat, Aug 13, 2016 at 5:20 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
> wrote:
>
>> James Sewell wrote:
>>
>> > 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=
>> (0:00000)LOG: consistent recovery state reached at 3/8811DFF0
>> > 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=
>> (0:XX000)FATAL: invalid memory alloc request size 3445219328
>> > 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=
>> (0:00000)LOG: database system is ready to accept read only connections
>> > 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=
>> (0:00000)LOG: startup process (PID 23614) exited with exit code 1
>> > 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=
>> (0:00000)LOG: terminating any other active server processes
>> > 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=
>> (0:00000)LOG: archiver process (PID 23627) exited with exit code 1
>>
>> What version is this?
>>
>> Hm, so the startup process finds the consistent point (which signals
>> postmaster so that line 23612/3 says "ready to accept read-only conns")
>> and immediately dies because of the invalid memory alloc error. I
>> suppose that error must be while trying to process some xlog record, but
>> without a xlog address it's difficult to say anything. I suppose you
>> could try to pg_xlogdump WAL starting at the last known good address
>> 3/8811DFF0 but I wouldn't know what to look for.
>>
>> One strange thing is that xlog replay sets up an error context, so you
>> would have had a line like "xlog redo HEAP" etc, but there's nothing
>> here. So maybe the allocation is not exactly in xlog replay, but
>> something different. We'd need to see a backtrace in order to see what.
>> Since this occurs in the startup process, probably the easiest way is to
>> patch the source to turn that error into PANIC, then re-run and examine
>> the resulting core file.
>>
>> --
>> Álvaro Herrera http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>
>

--

------------------------------
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

Attachment Content-Type Size
diagram.png image/png 665.7 KB

In response to

Browse pgsql-general by date

  From Date Subject
Next Message James Sewell 2016-08-12 21:56:55 Re: Critical failure of standby
Previous Message James Sewell 2016-08-12 21:54:01 Re: Critical failure of standby