BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: g(dot)shelkovy(at)arenadata(dot)io
Subject: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
Date: 2024-08-07 12:19:42
Message-ID: 18575-1b9b3b60e2a480de@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18575
Logged by: Georgy Silkly
Email address: g(dot)shelkovy(at)arenadata(dot)io
PostgreSQL version: 16.3
Operating system: docker alpine
Description:

Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which
results in the replica having data that is not on the master.

Steps to reproduce
script
```bash
#!/bin/sh -eux

echo "Goto script's directory"
cd "$(dirname "$0")"
echo "Stop master if exists"
test -f master/postmaster.pid && pg_ctl -w -D master stop -m fast || echo
$?
echo "Remove master directory"
rm -rf master
echo "Create master directory"
mkdir -p master
echo "Initialize master, data checksums are mandatory"
pg_ctl initdb -D master -o "--data-checksums -N -A trust --wal-segsize 1"
echo "Debug to see recycling WAL, keep only 5 WAL files (because segsize
1MB)"
cat >>master/postgresql.conf <<EOF
hot_standby = on
logging_collector = on
log_min_messages = debug
wal_keep_size = 5MB
EOF
echo "Accept replication connections on master"
cat >> master/pg_hba.conf <<EOF
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
EOF
echo "Start master"
pg_ctl -w -D master start -o "-p 5000"
echo "Stop standby1 if exists"
test -f standby1/postmaster.pid && pg_ctl -w -D standby1 stop -m fast ||
echo $?
echo "Remove standby1 directory"
rm -rf standby1
echo "Base backup is taken with xlog files included"
pg_basebackup -D standby1 -p 5000 -X fetch --verbose --write-recovery-conf
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Stop standby2 if exists"
test -f standby2/postmaster.pid && pg_ctl -w -D standby2 stop -m fast ||
echo $?
echo "Remove standby2 directory"
rm -rf standby2
echo "Base backup is taken with xlog files included"
pg_basebackup -D standby2 -p 5000 -X fetch --verbose --write-recovery-conf
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Create table and insert lot of rows"
psql -d postgres -a --no-psqlrc -p 5000 <<EOF
create table a (a int, b int);
insert into a select i, i from generate_series(1, 100000) i;
EOF
echo "Stop master to recycle WAL files"
pg_ctl -w -D master stop -m fast
echo "Stop standby1 to recycle WAL files"
pg_ctl -w -D standby1 stop -m fast
echo "Stop standby2 to recycle WAL files"
pg_ctl -w -D standby2 stop -m fast
echo "Start master"
pg_ctl -w -D master start -o "-p 5000"
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Now promote standby1"
pg_ctl -w -D standby1 promote
echo "Wait until standby1 is promoted"
while ! pg_isready -p 5001; do sleep 1; done
echo "Insert some rows"
psql -d postgres -a --no-psqlrc -p 5000 <<EOF
insert into a select i, i from generate_series(1, 340) i;
EOF
echo "Standby2 node need to be stopped as well."
pg_ctl -w -D standby2 stop -m fast
echo "Stop the master after standby promotion"
pg_ctl -w -D master stop -m fast
echo "Standby1 node need to be stopped as well."
pg_ctl -w -D standby1 stop -m fast
echo "Do rewind using standby1 pgdata as source"
pg_rewind --progress --debug --source-pgdata=standby1
--target-pgdata=standby2
echo "Parameters for standby2 postgresql.conf"
cat >> standby2/postgresql.conf <<EOF
primary_conninfo = 'port=5001'
EOF
echo "Place standby2 signal file"
touch standby2/standby.signal
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Same query"
psql -d postgres -a --no-psqlrc -p 5001 <<EOF
select count(*) from a;
EOF
echo "Different results"
psql -d postgres -a --no-psqlrc -p 5002 <<EOF
select count(*) from a;
EOF
echo "Stop standby1"
pg_ctl -w -D standby1 stop -m fast
echo "Stop standby2"
pg_ctl -w -D standby2 stop -m fast
```
output
```bash
+ echo 'Goto script'"'"'s directory'
Goto script's directory
+ dirname /var/lib/postgresql/r.sh
+ cd /var/lib/postgresql
+ echo 'Stop master if exists'
Stop master if exists
+ test -f master/postmaster.pid
+ echo 1
1
+ echo 'Remove master directory'
Remove master directory
+ rm -rf master
+ echo 'Create master directory'
Create master directory
+ mkdir -p master
+ echo 'Initialize master, data checksums are mandatory'
Initialize master, data checksums are mandatory
+ pg_ctl initdb -D master -o '--data-checksums -N -A trust --wal-segsize
1'
The files belonging to this database system will be owned by user
"postgres".
This user must also own the server process.

The database cluster will be initialized with locale "ru_RU.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "russian".

Data page checksums are enabled.

fixing permissions on existing directory master ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Yekaterinburg
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok

Sync to disk skipped.
The data directory might become corrupt if the operating system crashes.

Success. You can now start the database server using:

/usr/local/bin/pg_ctl -D master -l logfile start

+ echo 'Debug to see recycling WAL, keep only 5 WAL files (because segsize
1MB)'
Debug to see recycling WAL, keep only 5 WAL files (because segsize 1MB)
+ cat
+ echo 'Accept replication connections on master'
Accept replication connections on master
+ cat
+ echo 'Start master'
Start master
+ pg_ctl -w -D master start -o '-p 5000'
waiting for server to start....2024-08-07 17:14:59.431 +05 [19618] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:14:59.431 +05 [19618] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:14:59.434 +05 [19618] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:14:59.434 +05 [19618] DEBUG: created dynamic shared memory
control segment 1581037720 (26976 bytes)
2024-08-07 17:14:59.434 +05 [19618] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:14:59.434 +05 [19618] LOG: redirecting log output to logging
collector process
2024-08-07 17:14:59.434 +05 [19618] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Stop standby1 if exists'
Stop standby1 if exists
+ test -f standby1/postmaster.pid
+ echo 1
1
+ echo 'Remove standby1 directory'
Remove standby1 directory
+ rm -rf standby1
+ echo 'Base backup is taken with xlog files included'
Base backup is taken with xlog files included
+ pg_basebackup -D standby1 -p 5000 -X fetch --verbose
--write-recovery-conf
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/800028 on timeline 1
pg_basebackup: write-ahead log end point: 0/800100
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:00.282 +05 [19631] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:00.282 +05 [19631] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:00.286 +05 [19631] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:00.286 +05 [19631] DEBUG: created dynamic shared memory
control segment 1807324640 (26976 bytes)
2024-08-07 17:15:00.286 +05 [19631] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:00.286 +05 [19631] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:00.286 +05 [19631] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Stop standby2 if exists'
Stop standby2 if exists
+ test -f standby2/postmaster.pid
+ echo 1
1
+ echo 'Remove standby2 directory'
Remove standby2 directory
+ rm -rf standby2
+ echo 'Base backup is taken with xlog files included'
Base backup is taken with xlog files included
+ pg_basebackup -D standby2 -p 5000 -X fetch --verbose
--write-recovery-conf
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/900028 on timeline 1
pg_basebackup: write-ahead log end point: 0/900100
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:01.123 +05 [19643] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.123 +05 [19643] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.126 +05 [19643] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.126 +05 [19643] DEBUG: created dynamic shared memory
control segment 2259555868 (26976 bytes)
2024-08-07 17:15:01.126 +05 [19643] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.126 +05 [19643] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:01.126 +05 [19643] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Create table and insert lot of rows'
Create table and insert lot of rows
+ psql -d postgres -a --no-psqlrc -p 5000
create table a (a int, b int);
CREATE TABLE
insert into a select i, i from generate_series(1, 100000) i;
INSERT 0 100000
+ echo 'Stop master to recycle WAL files'
Stop master to recycle WAL files
+ pg_ctl -w -D master stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.650 +05 [19619]
DEBUG: logger shutting down
done
server stopped
+ echo 'Stop standby1 to recycle WAL files'
Stop standby1 to recycle WAL files
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.731 +05 [19632]
DEBUG: logger shutting down
done
server stopped
+ echo 'Stop standby2 to recycle WAL files'
Stop standby2 to recycle WAL files
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.831 +05 [19644]
DEBUG: logger shutting down
done
server stopped
+ echo 'Start master'
Start master
+ pg_ctl -w -D master start -o '-p 5000'
waiting for server to start....2024-08-07 17:15:01.872 +05 [19660] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.872 +05 [19660] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.876 +05 [19660] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.876 +05 [19660] DEBUG: created dynamic shared memory
control segment 2766949016 (26976 bytes)
2024-08-07 17:15:01.876 +05 [19660] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.876 +05 [19660] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:01.876 +05 [19660] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:01.979 +05 [19670] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.979 +05 [19670] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.982 +05 [19670] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.982 +05 [19670] DEBUG: created dynamic shared memory
control segment 1083876086 (26976 bytes)
2024-08-07 17:15:01.982 +05 [19670] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.983 +05 [19670] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:01.983 +05 [19670] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:02.085 +05 [19679] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.085 +05 [19679] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.088 +05 [19679] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.088 +05 [19679] DEBUG: created dynamic shared memory
control segment 4052049392 (26976 bytes)
2024-08-07 17:15:02.088 +05 [19679] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.089 +05 [19679] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:02.089 +05 [19679] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Now promote standby1'
Now promote standby1
+ pg_ctl -w -D standby1 promote
waiting for server to promote.... done
server promoted
+ echo 'Wait until standby1 is promoted'
Wait until standby1 is promoted
+ pg_isready -p 5001
/tmp:5001 - accepting connections
+ echo 'Insert some rows'
Insert some rows
+ psql -d postgres -a --no-psqlrc -p 5000
insert into a select i, i from generate_series(1, 340) i;
INSERT 0 340
+ echo 'Standby2 node need to be stopped as well.'
Standby2 node need to be stopped as well.
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.411 +05 [19680]
DEBUG: logger shutting down
done
server stopped
+ echo 'Stop the master after standby promotion'
Stop the master after standby promotion
+ pg_ctl -w -D master stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.517 +05 [19661]
DEBUG: logger shutting down
done
server stopped
+ echo 'Standby1 node need to be stopped as well.'
Standby1 node need to be stopped as well.
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.617 +05 [19671]
DEBUG: logger shutting down
done
server stopped
+ echo 'Do rewind using standby1 pgdata as source'
Do rewind using standby1 pgdata as source
+ pg_rewind --progress --debug '--source-pgdata=standby1'
'--target-pgdata=standby2'
pg_rewind: Source timeline history:
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1
pg_rewind: no rewind required
+ echo 'Parameters for standby2 postgresql.conf'
Parameters for standby2 postgresql.conf
+ cat
+ echo 'Place standby2 signal file'
Place standby2 signal file
+ touch standby2/standby.signal
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:02.723 +05 [19702] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.723 +05 [19702] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.726 +05 [19702] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.726 +05 [19702] DEBUG: created dynamic shared memory
control segment 3627107606 (26976 bytes)
2024-08-07 17:15:02.726 +05 [19702] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.726 +05 [19702] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:02.726 +05 [19702] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:02.830 +05 [19712] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.830 +05 [19712] DEBUG: mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.833 +05 [19712] DEBUG: dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.833 +05 [19712] DEBUG: created dynamic shared memory
control segment 2291233354 (26976 bytes)
2024-08-07 17:15:02.833 +05 [19712] DEBUG: max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.833 +05 [19712] LOG: redirecting log output to logging
collector process
2024-08-07 17:15:02.833 +05 [19712] HINT: Future log output will appear in
directory "log".
done
server started
+ echo 'Same query'
Same query
+ psql -d postgres -a --no-psqlrc -p 5001
select count(*) from a;
count
--------
100000
(1 row)

+ echo 'Different results'
Different results
+ psql -d postgres -a --no-psqlrc -p 5002
select count(*) from a;
count
--------
100340
(1 row)

+ echo 'Stop standby1'
Stop standby1
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703]
DEBUG: logger shutting down
done
server stopped
+ echo 'Stop standby2'
Stop standby2
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713]
DEBUG: logger shutting down
done
server stopped
```

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-08-07 14:46:06 Re: BUG #18573: Analyze command consumes several GB of memory - more than analyzed table size
Previous Message Yeddula, Madhusudhan reddy [CONTINGENT WORKER] 2024-08-07 11:14:38 RE: BUG #18569: Memory leak in Postgres Enterprise server