Quick Links

Re: 12.3 replicas falling over during WAL redo

From:	Ben Chobot <bench(at)silentmedia(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: 12.3 replicas falling over during WAL redo
Date:	2020-08-01 16:58:05
Message-ID:	a39efdb7-2dc7-4260-503d-5ddda4900822@silentmedia.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Alvaro Herrera wrote on 8/1/20 9:35 AM:
> On 2020-Aug-01, Ben Chobot wrote:
>
>> We have a few hundred postgres servers in AWS EC2, all of which do streaming
>> replication to at least two replicas. As we've transitioned our fleet to
>> from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a
>> streaming replica dying during replay. Postgres will log something like:
>>
>> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user=
>> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA
>> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
>> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user=
>> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
>> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:
>> startup process (PID 24076) exited with exit code 1|
> I've never seen this one.
>
> Can you find out what the index is being modified by those LSNs -- is it
> always the same index? Can you have a look at nearby WAL records that
> touch the same page of the same index in each case?
>
> One possibility is that the storage forgot a previous write.

I'd be happy to, if you tell me how. :)

We're using xfs for our postgres filesystem, on ubuntu bionic. Of course
it's always possible there's something wrong in the filesystem or the
EBS layer, but that is one thing we have not changed in the migration
from 9.5 to 12.3.

In response to

Re: 12.3 replicas falling over during WAL redo at 2020-08-01 16:35:51 from Alvaro Herrera

Responses

Re: 12.3 replicas falling over during WAL redo at 2020-08-03 04:39:42 from Kyotaro Horiguchi
Re: 12.3 replicas falling over during WAL redo at 2020-08-03 19:11:05 from Ben Chobot

Browse pgsql-general by date

	From	Date	Subject
Next Message	Michael Lewis	2020-08-01 17:02:52	Re: Apparent missed query optimization with self-join and inner grouping
Previous Message	Alvaro Herrera	2020-08-01 16:35:51	Re: 12.3 replicas falling over during WAL redo