From: | Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | [PATCH] Avoid useless prefetches in case of recent FPI WAL records |
Date: | 2025-03-29 22:57:28 |
Message-ID: | 9bdb428a-fc8e-4801-8fff-1ad0bc6fa76b@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
One customer complained that physical standby has significant replay lag
(few hours) on modern powerful bare metal servers. According to output
of "top", startup process and few kernel workers utilized CPU, mostly in
kernel space.
The incident has been resolved by disabling recovery_prefetch after
quick analysis of perf and strace outputs.
Problem description
---------------------
Actual XLog prefetching code doesn't consider that shared buffer may be
loaded from WAL FPI rather than from disk and makes fadvise calls in
such cases.
It results into amount of useless fadvise calls and slowness of WAL
apply process.
The full_page_writes are enabled by default, so after checkpoint any
first modification is logged into WAL with full page. The startup
process will read full page images from WAL stream and load them into
shared buffer by XLogReadBufferForRedoExtended.
It seems there is no point to prefetch shared buffer from disk if FPI
are logged recently.
Test case
----------
To reproduce case of useless prefetches, please create master instance
and physical standby with default shared buffer (128MiB), then insert
test data with size a bit more than shared buffers (32818 pages):
drop table if exists t1;
create table t1 (id int, k int, val text ) with (fillfactor =
10, autovacuum_enabled=off);
insert into t1 (id,k,val)
select i, 0, repeat('x', 800) from generate_series(1,32*1024 +
50) as i;
checkpoint;
Reset stats on standby server:
select pg_stat_reset_shared('io');
select pg_stat_reset_shared('recovery_prefetch');
Update table on master and check stats on standby:
checkpoint;
update t1 set k = k + 1;
select object, context,reads, stats_reset from pg_stat_io where
backend_type = 'startup';
select * from pg_stat_recovery_prefetch ;
Here is output:
postgres=# select object, context,reads, stats_reset from
pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-29 16:38:45.664648+03
relation | bulkwrite | 0 | 2025-03-29 16:38:45.664648+03
relation | init | 0 | 2025-03-29 16:38:45.664648+03
relation | normal | 9 | 2025-03-29 16:38:45.664648+03
relation | vacuum | 0 | 2025-03-29 16:38:45.664648+03
wal | init | | 2025-03-29 16:38:45.664648+03
wal | normal | 8100 | 2025-03-29 16:38:45.664648+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init |
skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-29 16:38:45.665322+03 | 32383 | 435 | 0 | 0
| 32818 | 32818 | 0 | 0 | 0
(1 row)
The startup process called fadvise 32383 times, skipped 32818 FPW
blocks, but read relations only 9 times.
Idea how to fix it
-------------------
As of now XLogPrefetcher skips FPW blocks without any further tracking.
It seems that book-keeping of this block to avoid further prefetching is
more correct.
Please find attached patch for this idea. It adds block into
recent_block & recent_rlocator.
After patch the stats are following:
postgres=# select object, context,reads, stats_reset from
pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-30 01:27:45.981169+03
relation | bulkwrite | 0 | 2025-03-30 01:27:45.981169+03
relation | init | 0 | 2025-03-30 01:27:45.981169+03
relation | normal | 9 | 2025-03-30 01:27:45.981169+03
relation | vacuum | 0 | 2025-03-30 01:27:45.981169+03
wal | init | | 2025-03-30 01:27:45.981169+03
wal | normal | 7895 | 2025-03-30 01:27:45.981169+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init |
skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-30 01:27:46.971602+03 | 0 | 0 | 0 | 0
| 32818 | 65636 | 0 | 0 | 0
(1 row)
i.e. there is no more useless fadvise() calls and amount of skipped
repeated blocks are increased by amount of modified blocks.
Please feel free to ask any questions!
Thank you!
--
Michael Zhilin
Postgres Professional
https://www.postgrespro.ru
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Avoid-useless-prefetches-in-case-of-recent-FPI-WA.patch | text/x-patch | 2.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-03-30 00:39:54 | Re: AIO v2.5 |
Previous Message | Bruce Momjian | 2025-03-29 22:22:53 | Re: Partial aggregates pushdown |