[PATCH] BUG FIX: inconsistent page found in BRIN_REGULAR_PAGE

From: 王海洋 <wanghaiyang(dot)001(at)bytedance(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: [PATCH] BUG FIX: inconsistent page found in BRIN_REGULAR_PAGE
Date: 2022-08-03 06:37:30
Message-ID: CACciXADOfErX9Bx0nzE_SkdfXr6Bbpo5R=v_B6MUTEYW4ya+cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi hackers,

I found that when wal_consistency_checking = brin is set, it may cause redo
abort, all the standby-nodes lost, and the primary node can not be restart.

This bug exists in all versions of PostgreSQL.

The operation steps are as follows:

1. Create a primary instance, set wal_consistency_checking = brin, and
start the primary instance.

initdb -D pg_test
echo "wal_consistency_checking = brin" >> pg_test/postgresql.conf
echo "port=53320" >> pg_test/postgresql.conf
pg_ctl start -D pg_test -l pg_test.logfile

2. Create a standby instance.

pg_basebackup -R -p 53320 -D pg_test_slave
echo "wal_consistency_checking = brin" >>
pg_test_slave/postgresql.conf
echo "port=53321" >> pg_test_slave/postgresql.conf
pg_ctl start -D pg_test_slave -l pg_test_slave.logfile

3. Execute brin_redo_abort.sql through psql, and find that the standby
machine is lost.

psql -p 53320 -f brin_redo_abort.sql

4. The standby instance is lost during redo, FATAL messages as follows:

FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0,
blkno 2

5. The primary instance cannot be restarted through pg_ctl restart -mi.

pg_ctl restart -D pg_test -mi -l pg_test.logfile

6. FATAL messages when restart primary instance as follows:

FATAL: inconsistent page found, rel 1663/12978/16387, forknum 0,
blkno 2

I analyzed the reasons as follows:

1. When the revmap needs to be extended by brinRevmapExtend,
we may set BRIN_EVACUATE_PAGE flag on a REGULAR_PAGE to prevent
other concurrent backends from adding more BrinTuple to that page
in brin_start_evacuating_page.

2. But, during redo-process, it is not needed to set BRIN_EVACUATE_PAGE
flag on that REGULAR_PAGE after removing the old BrinTuple in
brin_xlog_update, since no one will add BrinTuple to that Page at
this time.

3. As a result, this will cause a FATAL message to be thrown in
CheckXLogConsistency after redo, due to inconsistency checking of
the BRIN_EVACUATE_PAGE flag, finally cause redo to abort.

4. Therefore, the BRIN_EVACUATE_PAGE flag should be cleared before
CheckXLogConsistency.

For the above reasons, the patch file, sql file, shell script file, and the
log files are given in the attachment.

Best Regards!
Haiyang Wang

Attachment Content-Type Size
brin_redo_abort.sh application/octet-stream 716 bytes
0001-clear-BRIN_EVACUATE_PAGE-before-consistency-checking.patch application/octet-stream 2.3 KB
pg_test.logfile application/octet-stream 2.4 KB
brin_redo_abort.sql application/octet-stream 660 bytes
pg_test_slave.logfile application/octet-stream 1.7 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Guo 2022-08-03 07:45:19 Re: BUG #17564: Planner bug in combination of generate_series(), unnest() and ORDER BY
Previous Message Richard Guo 2022-08-02 11:28:39 Re: BUG #17564: Planner bug in combination of generate_series(), unnest() and ORDER BY