[BUG?] WAL file archive leads to crash during startup

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [BUG?] WAL file archive leads to crash during startup
Date: 2024-08-28 08:18:13
Message-ID: CALdSSPibFUbhdHv7bMo9Px5-hKEd-ZS8PxtYdhFTBQPw0rM3aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

Recently we faced issues with WAL recovery in our production. Issue
reproduces not frequently, once in 1.5-2 month.

So, the problem looks like this in our production cluster logs:

Postgresql startup process fails with signal 6.
"startup process (PID 369306) was terminated by signal 6:
Aborted",,,,,,,,,"","postmaster",,0
"PANIC: invalid magic number 0000 in log segment
000000010000000000000000, offset 0"

=== Conditions to repro:

There are conditions in which this problem occurs:

1) Postgresql process exited via OOM or something (crash recovery needed)
2) RedoStartLSN < CheckPointLoc AND RedoStartLSN is in different WAL
file than CheckPointLoc
3) Lets say RedoStartLSN is in file 000000010000000000000010. This
file IS not in the pg_wal directory (already archived).

p.3 is the tricky part, because I don't know if this WAL archive and
removal is OK or not. Maybe too early removal of WAL with RedoStartLSN
is the problem itself.

=== My repro:

This is how these conditions can be reproduced:

1) initdb, create table tt(i int, j int);
2) In two parallel sessions:
s1: insert into tt select * from generate_series(1,1000) a,
generate_series(1,1000)b;
s2: (wait a little and) CHECKPOINT;
3) killall -9 postgres
4) check if RedoStartLSN and CheckPointLoc are in different WAL files.
If so, take the RedoStartLSN WAL file and move it somewhere out of
pg_wal.

After i done this, this is pg_controldata output:

```
reshke(at)ygp-jammy:~/postgres$ /usr/local/pgsql/bin/pg_controldata ./db/
pg_control version number: 1300
Catalog version number: 202209061
Database system identifier: 7408076969759520641
Database cluster state: in crash recovery
pg_control last modified: Wed Aug 28 07:53:27 2024
Latest checkpoint location: 0/11400268
Latest checkpoint's REDO location: 0/10E49720
Latest checkpoint's REDO WAL file: 000000010000000000000010
Latest checkpoint's TimeLineID: 1
```
Try to start PG:

```
reshke(at)ygp-jammy:~/postgres$ /usr/local/pgsql/bin/postgres --single -D ./db/
2024-08-28 08:01:44.209 UTC [586161] DEBUG: invoking
IpcMemoryCreate(size=149471232)
2024-08-28 08:01:44.209 UTC [586161] DEBUG: mmap(150994944) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2024-08-28 08:01:44.209 UTC [586161] DEBUG: cleaning up dynamic
shared memory control segment with ID 3875438492
2024-08-28 08:01:44.217 UTC [586161] DEBUG: dynamic shared memory
system will support 674 segments
2024-08-28 08:01:44.217 UTC [586161] DEBUG: created dynamic shared
memory control segment 1123977252 (26976 bytes)
2024-08-28 08:01:44.217 UTC [586161] DEBUG: InitPostgres
2024-08-28 08:01:44.217 UTC [586161] DEBUG: my backend ID is 1
2024-08-28 08:01:44.217 UTC [586161] LOG: database system was
interrupted while in recovery at 2024-08-28 07:53:27 UTC
2024-08-28 08:01:44.217 UTC [586161] HINT: This probably means that
some data is corrupted and you will have to use the last backup for
recovery.
2024-08-28 08:01:44.217 UTC [586161] DEBUG: removing all temporary WAL segments
2024-08-28 08:01:44.226 UTC [586161] DEBUG: checkpoint record is at 0/11400268
2024-08-28 08:01:44.226 UTC [586161] DEBUG: redo record is at
0/10E49720; shutdown false
2024-08-28 08:01:44.226 UTC [586161] DEBUG: next transaction ID: 744;
next OID: 16392
2024-08-28 08:01:44.226 UTC [586161] DEBUG: next MultiXactId: 1; next
MultiXactOffset: 0
2024-08-28 08:01:44.226 UTC [586161] DEBUG: oldest unfrozen
transaction ID: 716, in database 1
2024-08-28 08:01:44.226 UTC [586161] DEBUG: oldest MultiXactId: 1, in
database 1
2024-08-28 08:01:44.226 UTC [586161] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2024-08-28 08:01:44.226 UTC [586161] LOG: database system was not
properly shut down; automatic recovery in progress
2024-08-28 08:01:44.226 UTC [586161] DEBUG: transaction ID wrap limit
is 2147484363, limited by database with OID 1
2024-08-28 08:01:44.226 UTC [586161] DEBUG: MultiXactId wrap limit is
2147483648, limited by database with OID 1
2024-08-28 08:01:44.226 UTC [586161] DEBUG: starting up replication slots
2024-08-28 08:01:44.227 UTC [586161] DEBUG: xmin required by slots:
data 0, catalog 0
2024-08-28 08:01:44.227 UTC [586161] DEBUG: starting up replication
origin progress state
2024-08-28 08:01:44.227 UTC [586161] DEBUG: didn't need to unlink
permanent stats file "pg_stat/pgstat.stat" - didn't exist
2024-08-28 08:01:44.231 UTC [586161] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2024-08-28 08:01:44.232 UTC [586161] DEBUG: could not open file
"pg_wal/000000010000000000000010": No such file or directory
2024-08-28 08:01:44.232 UTC [586161] LOG: redo is not required
2024-08-28 08:01:44.232 UTC [586161] PANIC: invalid magic number 0000
in log segment 000000010000000000000000, offset 0
Aborted (core dumped)
```

This is obviously very bad, especially the `redo is not required`
part, which is in fact simply skipping recovery, when recovery is 100%
needed.

Run with asserts compiled:
```
reshke(at)ygp-jammy:~/postgres$ /usr/local/pgsql/bin/postgres --single -D ./db/
2024-08-28 07:33:30.119 UTC [572905] DEBUG: invoking
IpcMemoryCreate(size=149471232)
2024-08-28 07:33:30.119 UTC [572905] DEBUG: mmap(150994944) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2024-08-28 07:33:30.119 UTC [572905] DEBUG: cleaning up dynamic
shared memory control segment with ID 1601540488
2024-08-28 07:33:30.151 UTC [572905] DEBUG: dynamic shared memory
system will support 674 segments
2024-08-28 07:33:30.152 UTC [572905] DEBUG: created dynamic shared
memory control segment 773415688 (26976 bytes)
2024-08-28 07:33:30.152 UTC [572905] DEBUG: InitPostgres
2024-08-28 07:33:30.152 UTC [572905] DEBUG: my backend ID is 1
2024-08-28 07:33:30.152 UTC [572905] LOG: database system was
interrupted while in recovery at 2024-08-28 07:31:48 UTC
2024-08-28 07:33:30.152 UTC [572905] HINT: This probably means that
some data is corrupted and you will have to use the last backup for
recovery.
2024-08-28 07:33:30.152 UTC [572905] DEBUG: removing all temporary WAL segments
2024-08-28 07:33:30.170 UTC [572905] DEBUG: checkpoint record is at 0/11400268
2024-08-28 07:33:30.170 UTC [572905] DEBUG: redo record is at
0/10E49720; shutdown false
2024-08-28 07:33:30.170 UTC [572905] DEBUG: next transaction ID: 744;
next OID: 16392
2024-08-28 07:33:30.170 UTC [572905] DEBUG: next MultiXactId: 1; next
MultiXactOffset: 0
2024-08-28 07:33:30.170 UTC [572905] DEBUG: oldest unfrozen
transaction ID: 716, in database 1
2024-08-28 07:33:30.170 UTC [572905] DEBUG: oldest MultiXactId: 1, in
database 1
2024-08-28 07:33:30.170 UTC [572905] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2024-08-28 07:33:30.170 UTC [572905] LOG: database system was not
properly shut down; automatic recovery in progress
2024-08-28 07:33:30.170 UTC [572905] DEBUG: transaction ID wrap limit
is 2147484363, limited by database with OID 1
2024-08-28 07:33:30.170 UTC [572905] DEBUG: MultiXactId wrap limit is
2147483648, limited by database with OID 1
2024-08-28 07:33:30.170 UTC [572905] DEBUG: starting up replication slots
2024-08-28 07:33:30.170 UTC [572905] DEBUG: xmin required by slots:
data 0, catalog 0
2024-08-28 07:33:30.170 UTC [572905] DEBUG: starting up replication
origin progress state
2024-08-28 07:33:30.170 UTC [572905] DEBUG: didn't need to unlink
permanent stats file "pg_stat/pgstat.stat" - didn't exist
2024-08-28 07:33:30.195 UTC [572905] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2024-08-28 07:33:30.195 UTC [572905] DEBUG: could not open file
"pg_wal/000000010000000000000010": No such file or directory
2024-08-28 07:33:30.195 UTC [572905] LOG: redo is not required
TRAP: FailedAssertion("!XLogRecPtrIsInvalid(RecPtr)", File:
"xlogreader.c", Line: 235, PID: 572905)
/usr/local/pgsql/bin/postgres(ExceptionalCondition+0x99)[0x562a1a4311f9]
/usr/local/pgsql/bin/postgres(+0x229d41)[0x562a1a05dd41]
/usr/local/pgsql/bin/postgres(FinishWalRecovery+0x79)[0x562a1a062519]
/usr/local/pgsql/bin/postgres(StartupXLOG+0x324)[0x562a1a0578e4]
/usr/local/pgsql/bin/postgres(InitPostgres+0x72a)[0x562a1a44344a]
/usr/local/pgsql/bin/postgres(PostgresMain+0xb1)[0x562a1a300261]
/usr/local/pgsql/bin/postgres(PostgresSingleUserMain+0xf1)[0x562a1a3024d1]
/usr/local/pgsql/bin/postgres(main+0x4f1)[0x562a19f918c1]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f12dc110d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f12dc110e40]
/usr/local/pgsql/bin/postgres(_start+0x25)[0x562a19f91925]
Aborted (core dumped)
```

Problem tracks down to commit
70e81861fadd9112fa2d425c762e163910a4ee52
We only observe this problem for PostgreSQL version 15. (15.6)

=== My understanding

So, xlogrecovery internals wrongly assumes that if no WAL record was
successfully fetched, then redo is not needed.

=== Proposed fix

Is as simply as attached. WFM, but this is probably not a correct way
to fix this.

--
Best regards,
Kirill Reshke

Attachment Content-Type Size
v1-0001-Fix.patch application/octet-stream 937 bytes

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2024-08-28 08:19:48 [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes
Previous Message Peter Eisentraut 2024-08-28 08:09:27 Add const qualifiers to XLogRegister*() functions