Is it possible for a WAL file to be missing records?

From: 와따가따 <lght2000(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Is it possible for a WAL file to be missing records?
Date: 2025-03-09 07:55:48
Message-ID: CAAEzU5vAfo0gD2u=TWAjFoeQLjHdTTaypM85mx+rQKHE8ht1OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

PostgreSQL version and HA extension in use
- PostgreSQL 13.10 version
- pg_auto_failover 2.0

CPU usage and load were increasing due to high load.

Failover was performed while a large number of WALwrite events occurred in
the primary DB.

I confirmed that the part where the secondary was not promoted was a
pg_auto_failover issue.

I promoted the secondary manually.

And I originally tried to make the primary DB a new secondary using the
archived wal file, but there seemed to be a missing WAL record.

So, I opened the WAL file using pg_waldump and there was a missing record.

It was not a DB server crash.
Can records not be recorded in the WAL file even when a failover is
performed due to high load?

I'm wondering if this could be considered a bug or if it was a situation
where WAL records could be lost.

I will send you the information confirmed through DB log and pg_waldump.

I'll share some DB settings too.
hot_standby_feedback = on
hot_standby = on
synchronous_commit = on
wal_writer_flush_after = 1MB
wal_sync_method = fdatasync
wal_writer_delay = 200ms
wal_buffers = 16MB
wal_segment_size= 16MB

*[When the first failover occurs]*
*- WAL apply DB log*
[image: image.png]
*- Check the wal record using pg_waldump*
I verified that there are no missing lsn in 0000000300005015000000A6 and
0000000300005015000000A7.
However, the prev lsn shown in 0000000300005015000000A8 is not found in
0000000300005015000000A7.
- The last LSN of 0000000300005015000000A7 is 5015/A6003778
-The prev LSN of the first record of 0000000300005015000000A8 is
5015/A7FFED78.
[image: image.png]

*[When the second failover occurs]*
*- DB log*
[image: image.png]

- Check the wal record using pg_waldump
The last LSN of 000000030000501E0000008E is 501E/8EFFCED8.
The prev lsn of the first record in 000000030000501E0000008F wal file is
501E/8EFFEEC8.
It appears to have been lost due to the large difference in LSN.

[image: image.png]

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2025-03-09 17:14:53 Re: Error from array_agg when table has many rows
Previous Message David Rowley 2025-03-09 07:22:30 Re: Error from array_agg when table has many rows