BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: kotak(dot)nikhil(at)gmail(dot)com
Subject: BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments
Date: 2024-09-28 03:11:15
Message-ID: 18640-2a2df650791eab97@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18640
Logged by: nikhil kotak
Email address: kotak(dot)nikhil(at)gmail(dot)com
PostgreSQL version: 14.10
Operating system: Redhat Enterprise Linux 7.9
Description:

We are running a Patroni HA setup with PostgreSQL in our environment, where
we have 2 or more replicas depending on the application tier. During regular
maintenance activities, such as OS patching or weekend server shutdowns, we
stop the Patroni service on the replicas while the original leader remains
up.

These maintenance activities typically last around 2-3 hours. Once the
servers are returned to operational status by our UNIX System Administrator,
we attempt to restart the Patroni service on the replicas. However, we
frequently encounter an issue where the replicas remain out of sync, and the
following error message appears in the alert logs:

"Could not receive data from WAL stream: ERROR: requested WAL segment <123>
has already been removed"

Upon investigation, we observe that the WAL segment no longer exists on the
replica (which was down), but it still exists on the leader. After manually
copying the missing WAL segment from the leader to the replica, the replica
successfully resumes syncing on its own.

Issue: The problem is that the replica does not automatically attempt to
fetch the missing WAL segments from the primary once it is brought back
online. We are forced to manually intervene, which adds unnecessary
complexity and delay in restoring HA functionality after downtime.

Expected Behavior: We expect the replica to automatically request and fetch
the missing WAL segments from the primary (leader) upon startup, ensuring it
can sync up without manual intervention.

Could you please help us understand why this behavior occurs, and whether it
can be addressed within PostgreSQL or Patroni to ensure automatic recovery
for the replicas?

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-09-28 11:00:01 BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
Previous Message Richard Guo 2024-09-28 00:39:53 Re: BUG #18634: Wrong varnullingrels with merge ... when not matched by source