12.3 replicas falling over during WAL redo

From: Ben Chobot <bench(at)silentmedia(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: 12.3 replicas falling over during WAL redo
Date: 2020-08-01 14:55:27
Message-ID: 8a8d6728-22ad-61ec-09f9-51f29b9a2f61@silentmedia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We have a few hundred postgres servers in AWS EC2, all of which do
streaming replication to at least two replicas. As we've transitioned
our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the
frequency of a streaming replica dying during replay. Postgres will log
something like:

|2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1]
db=,user= LOG: restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1]
db=,user= FATAL: incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2]
db=,user= CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM:
lastBlockVacuumed 1720 2020-07-31T16:55:24.642877+00:00 hostA
postgres[24074]: [8-1] db=,user= LOG: startup process (PID 24076) exited
with exit code 1|

...or:

|2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user=
WARNING: will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user=
CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax
128308492 flags 0x10 ; new off 2 xmax 0 2020-07-31T10:43:35.520180+00:00
hostB postgres[11836]: [13-1] db=,user= PANIC: failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user=
CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax
128308492 flags 0x10 ; new off 2 xmax 0 |

...or:
||

2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
||

Each time, a simple restart of the postgres service will bring the
database back to a happy state and it will merrily replicate past the
LSN where it had died before. This has never (yet) happened on a primary
db, and (so far) always on only one one of the replicas the primary is
replicating to, leading me to think there isn't anything actually wrong
with my data. Still, this is no way to run a database. We never saw this
problem in 9.5, but it happened 3 times just on Friday. We have taken
the opportunity to enable checksuming with our move to 12, but this
doesn't appear to be related to that, to my untrained eyes.

A lot of suggestions I've heard to fix problems which sound like this
involve reindexing, and while we haven't yet tried that, it doesn't seem
likely to help. We upgraded from 9.5 to 12 using pglogical, so all our
indices were created from 12.3 code. That said, we were running
pg_repack in an automated fashion for a bit, which seemed to be causing
issues in 12 that we haven't had time to track down and so have
currently disabled.

With the non-durability of the problem, I'm out of ideas of what to look
for. Suggestions?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2020-08-01 16:35:51 Re: 12.3 replicas falling over during WAL redo
Previous Message Zack Weinberg 2020-07-31 15:07:39 Apparent missed query optimization with self-join and inner grouping