Re: pg_upgrade and frozen xids

From: bricklen <bricklen(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Natalie Wenz <nataliewenz(at)ebureau(dot)com>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: pg_upgrade and frozen xids
Date: 2018-03-07 20:21:23
Message-ID: CAGrpgQ9apRxeCng82nd0qwD7bKtNPebT8XtTcC0NxddBgcUnNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Wed, Mar 7, 2018 at 12:01 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> I happen to know that bricklen already ran amcheck. There were errors,
> but they were not consistent with a collation issue. Rather, it looked
> like something was up with the storage layer -- the sibling links of a
> pair of pages were not in mutual agreement.
>
> Even if that wasn't something that I knew already, I still would not
> suspect opclass misbehavior of any variety. VACUUM doesn't care about
> the ordering of items on the page in the case of nbtree. And, it
> performs a physical order scan there (albeit with some extra trickery
> to prevent races due to concurrent splits). Index tuples that could
> end up being unreachable to index scans due to opclass misbehavior
> should remain reachable to VACUUM.
>

​What little detail I've been able to collect so far is below. All for 10.1
clusters.

From the postgres logs, for 6 different databases (across 3 geo regions, of
which two were on the same hypervisor). Each one was discovered when
autovacuum tried to vacuum them:

ERROR: could not find left sibling of block 4775 in index "<some index>"
ERROR: right sibling 13983 of block 7196 is not next child 7246 of block
5208 in index "<some index>"
ERROR: right sibling 60252 of block 60115 is not next child 60118 of block
60113 in index "<some index>"
ERROR: right sibling 93058 of block 93057 is not next child 93061 of block
93008 in index "<some index>"
ERROR: right sibling 10081 of block 10079 is not next child 10084 of block
10046 in index "<some index>"
ERROR: left link changed unexpectedly in block 13868 of index "<some
index>"
ERROR: right sibling 145 of block 92 is not next child 93 of block 3 in
index "<some index>"

A strace from the hung autovac process (before we killed it):

futex(0x7f07b8f575f8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
futex(0x7f07b8f575f8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
futex(0x7f07b8f575f8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
...

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Mark Kirkwood 2018-03-07 21:01:53 Re: Reliable WAL file shipping over unreliable network
Previous Message Peter Geoghegan 2018-03-07 20:11:15 Re: pg_upgrade and frozen xids