Re: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4739126 (now 1422751)

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Li EF Zhang <bjzhangl(at)cn(dot)ibm(dot)com>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4739126 (now 1422751)
Date: 2021-01-02 11:50:06
Message-ID: CABUevExrp8sHsosA6NMqxZ4xqZPczDnfhYvQ8zzmv02C2-dcgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Sat, Jan 2, 2021 at 12:38 PM Li EF Zhang <bjzhangl(at)cn(dot)ibm(dot)com> wrote:
>
> We have a postgresql database cluster with 3 node, one is primary, the other 2 are secondary. The cluster is managed by patroni. The 3 databases are in respective containers. When we update one secondary container image, the database in this container can not start. It reports "database system is starting up". We tried to delete the container and recover it from primary, the same error is reported.
>
> I checked the db log, there is a message:
> DEBUG: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4739126 (now 1422751)#0122020-12-15 07:40:08.513 UTC [146-5182] CONTEXT: WAL redo at 3/ED11B020 for Standby/RUNNING_XACTS: nextXid 4739482 latestCompletedXid 4739475 oldestRunningXid 1422751; 16 xacts: 2716862 2721890 4665244 2495592 2289138 2288820 2287653 1422751 4280517 2288510 2287620 3297674 1757103 4739326 3320989 2259670; subxid ovf
>
>
> Seems snapshot overflowed which causes the secondary can not start up. I am newer to postgresql. I do not know very clearly how this happened and how to fix it. Thanks!

AFAICT it looks like you have a very old running transaction on the
system (transaction id 1422751 is 3 million transactions ago, and
there are other transactions also very old). As mentioned on
https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN,
having very long running transactions in combinations with many
subtransactions can cause it to take a long time for the system to
reach a consistent state. See also the caveat list at the bottom of
that page.

If those transactions are such that they're needed, you have no other
choice but to wait I believe. But in (the more likely, I'd say) they
represent hung or otherwise broken clients, then terminating those
sessions on the primary should help the process along.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message VAE Ventures 2021-01-04 01:33:22 Postgres 12.4 inner join with where statement = 'string' returning error
Previous Message Li EF Zhang 2021-01-02 04:11:14 recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4739126 (now 1422751)