From: | 王海洋 <wanghaiyang(dot)001(at)bytedance(dot)com> |
---|---|
To: | PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | [PATCH] BUG FIX: inconsistent page found in BRIN_REGULAR_PAGE |
Date: | 2022-08-03 06:37:30 |
Message-ID: | CACciXADOfErX9Bx0nzE_SkdfXr6Bbpo5R=v_B6MUTEYW4ya+cg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi hackers,
I found that when wal_consistency_checking = brin is set, it may cause redo
abort, all the standby-nodes lost, and the primary node can not be restart.
This bug exists in all versions of PostgreSQL.
The operation steps are as follows:
1. Create a primary instance, set wal_consistency_checking = brin, and
start the primary instance.
initdb -D pg_test
echo "wal_consistency_checking = brin" >> pg_test/postgresql.conf
echo "port=53320" >> pg_test/postgresql.conf
pg_ctl start -D pg_test -l pg_test.logfile
2. Create a standby instance.
pg_basebackup -R -p 53320 -D pg_test_slave
echo "wal_consistency_checking = brin" >>
pg_test_slave/postgresql.conf
echo "port=53321" >> pg_test_slave/postgresql.conf
pg_ctl start -D pg_test_slave -l pg_test_slave.logfile
3. Execute brin_redo_abort.sql through psql, and find that the standby
machine is lost.
psql -p 53320 -f brin_redo_abort.sql
4. The standby instance is lost during redo, FATAL messages as follows:
FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0,
blkno 2
5. The primary instance cannot be restarted through pg_ctl restart -mi.
pg_ctl restart -D pg_test -mi -l pg_test.logfile
6. FATAL messages when restart primary instance as follows:
FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0,
blkno 2
I analyzed the reasons as follows:
1. When the revmap needs to be extended by brinRevmapExtend,
we may set BRIN_EVACUATE_PAGE flag on a REGULAR_PAGE to prevent
other concurrent backends from adding more BrinTuple to that page
in brin_start_evacuating_page.
2. But, during redo-process, it is not needed to set BRIN_EVACUATE_PAGE
flag on that REGULAR_PAGE after removing the old BrinTuple in
brin_xlog_update, since no one will add BrinTuple to that Page at
this time.
3. As a result, this will cause a FATAL message to be thrown in
CheckXLogConsistency after redo, due to inconsistency checking of
the BRIN_EVACUATE_PAGE flag, finally cause redo to abort.
4. Therefore, the BRIN_EVACUATE_PAGE flag should be cleared before
CheckXLogConsistency.
For the above reasons, the patch file, sql file, shell script file, and the
log files are given in the attachment.
Best Regards!
Haiyang Wang
Attachment | Content-Type | Size |
---|---|---|
brin_redo_abort.sh | application/octet-stream | 716 bytes |
0001-clear-BRIN_EVACUATE_PAGE-before-consistency-checking.patch | application/octet-stream | 2.3 KB |
pg_test.logfile | application/octet-stream | 2.4 KB |
brin_redo_abort.sql | application/octet-stream | 660 bytes |
pg_test_slave.logfile | application/octet-stream | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2022-08-03 07:45:19 | Re: BUG #17564: Planner bug in combination of generate_series(), unnest() and ORDER BY |
Previous Message | Richard Guo | 2022-08-02 11:28:39 | Re: BUG #17564: Planner bug in combination of generate_series(), unnest() and ORDER BY |