Quick Links

Vacuum scans all-frozen pages with checksums enabled

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>
Subject:	Vacuum scans all-frozen pages with checksums enabled
Date:	2024-12-05 16:07:52
Message-ID:	CAAKRu_ZvkxDRtMAhYRTK9N60jfMMVkiQgnEP5m_asYnuvgmQOg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I investigated what I thought was an issue with my patch to amortize
aggressive vacuum [1] and found some interesting behavior also present
on master. It seems vacuum will scan many all-frozen pages due to an
interaction between the opportunistic freeze heuristic and
SKIP_PAGES_THRESHOLD. This is not really an actionable finding (no
clear way to fix it) but I wanted to put a repro and analysis on the
mailing list in case anyone else ever notices this odd behavior.

Vacuum will opportunistically freeze a page if we just emitted an FPI
(or would emit one). This has a very peculiar interaction with several
other subsystems that can lead us to eventually scan huge numbers of
already frozen pages during vacuum.

In an insert-only workload, you would expect that because transaction
IDs mostly go from smaller to larger as block numbers become larger
that you would end up freezing contiguous ranges of pages once tuples'
xmins start being old enough.

However, we don't end up freezing contiguous ranges when checksums are enabled.

In an insert-only workload (like the one I provide at the bottom of
the email), the first time we vacuum a page is often when we set hint
bits. If checksums are enabled and this is the first time we are
dirtying the buffer since the last checkpoint, this will emit an FPI.
So, if bgwriter or checkpointer write out the page after the last
tuple is inserted to the page before the page is vacuumed, we will
likely emit an FPI when vacuuming it.

And because we opportunistically freeze pages if we emitted an FPI
setting their hint bits, we end up freezing a number of pages for this
reason.

What is interesting is that the ranges of contiguous frozen and
unfrozen pages are pretty small.

For example:

start_block | end_block | nblocks | all_visible | all_frozen
-------------+-----------+---------+-------------+------------
1396695 | 1396695 | 1 | t | f
1396696 | 1396710 | 14 | t | t
1396711 | 1396711 | 1 | t | f
1396712 | 1396730 | 18 | t | t
1396731 | 1396731 | 1 | t | f
1396732 | 1396758 | 26 | t | t
1396759 | 1396759 | 1 | t | f
1396760 | 1396767 | 7 | t | t
1396768 | 1396768 | 1 | t | f

This fragmentation usually starts with the first vacuum after we have
dirtied enough buffers to provide work for bgwriter and at least one
checkpoint has happened. The most likely explanation is that because
bgwriter writes out buffers in the buffer clock order (not the order
the blocks are in the table), there are small random ranges of dirty
and clean buffers in the table. Combine that with checkpointer writing
out other buffers and the effect can be even more dramatic.

Here is some log output for the first vacuum after a checkpoint had
started (I patched master to print the number of hint bit fpis in
autovacuum log output):

automatic vacuum of table "melanieplageman.public.history":
pages: 0 removed, 605229 remain, 363538 scanned (60.07% of total)
frozen: 190990 pages from table (31.56% of total) had 10503305 tuples frozen
WAL usage: 745501 records, 205124 full page images, 190999 hint bit
fpis, 205494134 bytes

During an aggressive vacuum, all all-visible pages will be scanned
(because we must scan every unfrozen tuple to advance relfrozenxid).
All-frozen pages can be skipped. However, if the range of skippable
blocks is less than SKIP_PAGES_THRESHOLD (hard-coded to 32), then we
will scan these skippable blocks anyway -- on the assumption that
requesting small ranges from the kernel will mess with readahead.

The result of all of this is that aggressive vacuums may read large
numbers of frozen pages when checksums are enabled. Here is the
autovacuum logging output of one example (I've patched master to print
the pages scanned due to SKIP_PAGES_THRESHOLD and how many of those
are frozen):

automatic aggressive vacuum to prevent wraparound of table
"melanieplageman.public.history":
pages: 0 removed, 2454172 remain, 1437630 scanned (58.58% of total)
frozen skippable blocks scanned: 162335

As I said, I don't think we can do anything about this unless we find
a good alternative freeze heuristic (I've been trying and failing at
that for two years).

I attached a patch to log the frozen skippable blocks scanned by
vacuum and a repro.

- Melanie

--------
Repro:

The repro below includes a lot of GUCs that are required to get an
aggressive vacuum in a reasonable amount of time and see the effect.
It doesn't need to run for the full number of transactions to see the
effect -- just until the first aggressive vacuum of the history table.

psql -c "ALTER SYSTEM SET shared_buffers = '1GB';" \
-c "ALTER SYSTEM SET log_checkpoints = on;" \
-c "ALTER SYSTEM SET max_wal_size = '150 GB';" \
-c "ALTER SYSTEM SET min_wal_size = '150 GB';" \
-c "ALTER SYSTEM SET autovacuum_naptime = 10;" \
-c "ALTER SYSTEM SET log_autovacuum_min_duration = 0;" \
-c "ALTER SYSTEM SET synchronous_commit = off;" \
-c "ALTER SYSTEM SET checkpoint_timeout = '2min';" \
-c "ALTER SYSTEM SET vacuum_cost_limit = 2000;" \
-c "ALTER SYSTEM SET wal_compression = 'zstd';" \
-c "ALTER SYSTEM SET wal_buffers = '2MB';" \
-c "ALTER SYSTEM SET vacuum_freeze_min_age=10000000;" \
-c "ALTER SYSTEM SET autovacuum_freeze_max_age=10000000;" \
-c "ALTER SYSTEM SET vacuum_freeze_table_age=8000000;" \
-c "ALTER SYSTEM SET autovacuum_vacuum_insert_scale_factor='0.01';" \
-c "ALTER SYSTEM SET maintenance_work_mem = '1GB';" \
-c "ALTER SYSTEM SET autovacuum_vacuum_insert_threshold=10000000;"

pg_ctl restart

psql -c "CREATE TABLE history( id BIGINT PRIMARY KEY GENERATED BY
DEFAULT AS IDENTITY, client_id INT NOT NULL, mtime TIMESTAMPTZ DEFAULT
NOW(), data TEXT);"

pgbench \
--random-seed=0 \
--no-vacuum \
-M prepared \
-c 8 \
-j 8 \
-t 11100000 \
-f- <<EOF
INSERT INTO history(client_id, data)
VALUES
(:client_id, repeat('a', 90)),
(:client_id, repeat('b', 90)),
(:client_id, repeat('c', 90)),
(:client_id, repeat('d', 90)),
(:client_id, repeat('e', 90)),
(:client_id, repeat('f', 90)),
(:client_id, repeat('g', 90)),
(:client_id, repeat('h', 90)),
(:client_id, repeat('i', 90)),
(:client_id, repeat('j', 90));
EOF

[1] https://www.postgresql.org/message-id/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com

Attachment	Content-Type	Size
Track-skippable-blocks-scanned.patch	text/x-patch	2.8 KB

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2024-12-05 16:19:18	Re: code contributions for 2024, WIP version
Previous Message	Andres Freund	2024-12-05 15:41:42	Re: Why we need to check for local buffers in BufferIsExclusiveLocked and BufferIsDirty?