From: | Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | [PATCH] Btree BackwardScan race condition on Standby during VACUUM |
Date: | 2020-03-16 14:07:53 |
Message-ID: | CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello, hackers.
------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.
------ REMARK ------
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR: could not find left sibling of block XXXX in index XXXX"
during stress tests.
I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later... I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.
------ HOW TO REPRODUCE ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):
1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'
2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543
3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;
4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"
5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543
6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"
7) You should see something like this:
> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .....
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 1451 in index "pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .....
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 2104 in index "pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .....
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 2501 in index "pgbench_accounts_pkey"
I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.
It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
> + pg_usleep(10 * 1000L);
>
> /* Rewrite target page as empty deleted page */
> buffer = XLogInitBufferForRedo(record, 0);
------ WHAT HAPPENS ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.
btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:
> * In normal operation, we would lock all the pages this WAL record
> * touches before changing any of them. In WAL replay, it should be okay
> * to lock just one page at a time, since no concurrent index updates can
> * be happening, and readers should not care whether they arrive at the
> * target page or not (since it's surely empty).
_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:
> Moving left in a backward scan is complicated because we must consider
> the possibility that the left sibling was just split (meaning we must find
> the rightmost page derived from the left sibling), plus the possibility
> that the page we were just on has now been deleted and hence isn't in the
> sibling chain at all anymore.
So, this is how race is happens:
0) this is target page (B) and its siblings.
A <---> B <---> C ---> END
1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^ ^
\_____________/
But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].
2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.
But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.
------ HOW TO FIX ------
The first idea was to mark page as BTP_DELETED before updating siblings links.
Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev
Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..
Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.
So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.
Patch is attached (on top of REL_12_STABLE).
Thanks,
Michail.
[1]: https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696
Attachment | Content-Type | Size |
---|---|---|
btree-race.patch | text/x-patch | 4.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2020-03-16 14:17:33 | Re: error context for vacuum to include block number |
Previous Message | Julien Rouhaud | 2020-03-16 14:05:20 | Re: Collation versioning |