Postgresql 9.5: Streaming Replication: Secondaries Fail To Start Post WAL Error

From: Mohan NBSPS <mohan(dot)nbs(dot)ont(at)gmail(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Cc: Mohan NBSPS <mohan(dot)nbs(dot)ont(at)gmail(dot)com>
Subject: Postgresql 9.5: Streaming Replication: Secondaries Fail To Start Post WAL Error
Date: 2024-05-28 18:26:41
Message-ID: CAPCvfWcm0JDC+q54MSW7N90PYvh+PefaP6SxfonbkGcUwpS1+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Dear Community,

I am trying to understand why all the secondary databases failed to start
after seeing a WAL related error for some time.

Timeline:

2024-04-19: WAL errors appear in the secondary database nodes

```
LOG: invalid resource manager ID 55 at 40/F46CBCA8
```

- the secondaries did not lag in replication
- monitored via query
```
pg_last_xact_replay_timestamp
```

- 2024-05-02; Secondaries reboot and fail to start up

```
FATAL: could not receive data from WAL stream: ERROR: requested WAL
segment 000000010000004100000049 has already been removed
FATAL: the database system is starting up
```

from my understanding, the WAL file is streamed over the network (secondary
pulls from primary) and creates a WAL file in the secondary.
then it replays the copied WAL file using a different process.

in order for the local WAL file to go out of sync,

1. the primary removed the WAL file, the secondary was streaming
2. the WAL file on the secondary got corrupted
3 ....

Questions

- what do those error messages mean ?
- how can I prevent this from happening ?

- references
- https://www.postgresql.org/docs/9.5/wal-configuration.html

Any advice/information is highly appreciated.
thank you
mohan

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Johannes Truschnigg 2024-05-28 18:47:12 Re: Postgresql 9.5: Streaming Replication: Secondaries Fail To Start Post WAL Error
Previous Message Muhammad Imtiaz 2024-05-28 15:48:50 Re: Pg_squeze