Apparent bug in WAL summarizer process (hung state)

From: Israel Barth Rubio <barthisrael(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Apparent bug in WAL summarizer process (hung state)
Date: 2024-06-24 17:56:00
Message-ID: CAO_rXXDaMnhkXXpJbS2erzaSx5nyPsYZUKfM0gfmy6UWjrfSMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Hope you are doing well.

I've been playing a bit with the incremental backup feature which might
come as
part of the 17 release, and I think I hit a possible bug in the WAL
summarizer
process.

The issue that I face refers to the summarizer process getting into a hung
state.
When the issue is triggered, it keeps in an infinite loop trying to process
a WAL
file that no longer exists. It apparently comes up only when I perform
changes to
`wal_summarize` GUC and reload Postgres, while there is some load in
Postgres
which makes it recycle WAL files.

I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL
files
available in `pg_wal` and make it easier to reproduce the issue, I'm using
a low
value for `max_wal_size` ('100MB'). You can find below the steps that I
took to
reproduce this problem, assuming this small `max_wal_size`, and
`summarize_wal`
initially enabled:

```bash
# Assume we initially have max_wal_size = '100MB' and summarize_wal = on

# Create a table of ~ 100MB
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_1

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental
full_backup_1/backup_manifest

# Disable summarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO off"
psql -c "SELECT pg_reload_conf()"

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Re-enable sumarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO on"
psql -c "SELECT pg_reload_conf()"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_2

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental
full_backup_2/backup_manifest
```

I'm able to reproduce the issue most of the time when running these steps
manually. It's harder to reproduce if I attempt to run those commands as a
bash script.

This is the sample output of a run of those commands:

```console

(barman) [postgres(at)barmandevhost ~]$ psql -c "CREATE TABLE test AS
SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres(at)barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
full_backup_1NOTICE: WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331785/331785 kB (100%), 1/1 tablespace(barman)
[postgres(at)barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres(at)barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres(at)barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_1 --incremental
full_backup_1/backup_manifestNOTICE: WAL archiving is not enabled;
you must ensure that all required WAL segments are copied through
other means to complete the backup111263/331720 kB (33%), 1/1
tablespace(barman) [postgres(at)barmandevhost ~]$ psql -c "ALTER SYSTEM
SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres(at)barmandevhost
~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf----------------
t(1 row)
(barman) [postgres(at)barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres(at)barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman)
[postgres(at)barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO
on"ALTER SYSTEM(barman) [postgres(at)barmandevhost ~]$ psql -c "SELECT
pg_reload_conf()" pg_reload_conf---------------- t(1 row)
(barman) [postgres(at)barmandevhost ~]$ pg_basebackup -X none -c fast -P
-D full_backup_2NOTICE: WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331734/331734 kB (100%), 1/1 tablespace(barman)
[postgres(at)barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_2 --incremental
full_backup_2/backup_manifestWARNING: still waiting for WAL
summarization through 2/C1000028 after 10 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 20 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting
for WAL summarization through 2/C1000028 after 30 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 40 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting
for WAL summarization through 2/C1000028 after 50 secondsDETAIL:
Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in
memory.WARNING: still waiting for WAL summarization through
2/C1000028 after 60 secondsDETAIL: Summarization has reached
2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: aborting backup
due to backend exiting before pg_backup_stop was calledpg_basebackup:
error: could not initiate base backup: ERROR: WAL summarization is
not progressingDETAIL: Summarization is needed through 2/C1000028,
but is stuck at 2/B30000D8 on disk and 2/B30000D8 in
memory.pg_basebackup: removing data directory "incremental_backup_2"

```

I took an `ls` output from `pg_wal` as well as `strace` and `gdb` from the
WAL
summarizer process. I'm attaching that to this email hoping that can help
somehow.

FWIW, once I restart Postgres the WAL summarizer process gets back to normal
functioning. It seems to me there is some race condition between when a WAL
file
is removed and when `summarize_wal` is re-enabled, causing the process to
keep
looking for a WAL file that is the past.

Best regards,
Israel.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Israel Barth Rubio 2024-06-24 17:59:18 Re: Apparent bug in WAL summarizer process (hung state)
Previous Message Peter Geoghegan 2024-06-24 17:54:10 Re: POC, WIP: OR-clause support for indexes