WAL record CRC calculated incorrectly because of underlying buffer modification

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WAL record CRC calculated incorrectly because of underlying buffer modification
Date: 2024-05-10 13:00:01
Message-ID: 055bb33c-bccc-bc1d-c2f8-8598534448ac@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I've investigated a recent buildfarm failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-02%2006%3A40%3A36

where the test failed due to a CRC error:
2024-05-02 17:08:18.401 ACST [3406:7] LOG:  incorrect resource manager data checksum in record at 0/F14D7A60

(Chipmunk produced similar errors as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-08-25%2019%3A40%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2024-03-22%2003%3A20%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-08-19%2006%3A43%3A20
)

and discovered that XLogRecordAssemble() calculates CRC over a buffer,
that might be modified by another process.

With the attached patch applied, the following test run:
echo "
autovacuum_naptime = 1
autovacuum_vacuum_threshold = 1

wal_consistency_checking = all
" >/tmp/temp.config

for ((i=1;i<=100;i++)); do echo "iteration $i"; TEMP_CONFIG=/tmp/temp.config TESTS="test_setup hash_index" make
check-tests -s || break; done

fails for me on iterations 7, 10, 17:
ok 1         - test_setup                               2557 ms
not ok 2     - hash_index                              24719 ms
# (test process exited with exit code 2)

postmaster.log contains:
2024-05-10 12:46:44.320 UTC checkpointer[1881151] LOG:  checkpoint starting: immediate force wait
2024-05-10 12:46:44.365 UTC checkpointer[1881151] LOG:  checkpoint complete: wrote 41 buffers (0.3%); 0 WAL file(s)
added, 0 removed, 26 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000
s; distance=439134 kB, estimate=527137 kB; lsn=0/3CE131F0, redo lsn=0/3CE13198
TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: "xloginsert.c", Line: 949, PID: 1881271
ExceptionalCondition at assert.c:52:13
XLogRecordAssemble at xloginsert.c:953:1
XLogInsert at xloginsert.c:520:9
hashbucketcleanup at hash.c:844:14
hashbulkdelete at hash.c:558:3
index_bulk_delete at indexam.c:760:1
vac_bulkdel_one_index at vacuum.c:2498:10
lazy_vacuum_one_index at vacuumlazy.c:2443:10
lazy_vacuum_all_indexes at vacuumlazy.c:2026:26
lazy_vacuum at vacuumlazy.c:1944:10
lazy_scan_heap at vacuumlazy.c:1050:3
heap_vacuum_rel at vacuumlazy.c:503:2
vacuum_rel at vacuum.c:2214:2
vacuum at vacuum.c:622:8
autovacuum_do_vac_analyze at autovacuum.c:3102:2
do_autovacuum at autovacuum.c:2425:23
AutoVacWorkerMain at autovacuum.c:1569:3
pgarch_die at pgarch.c:846:1
StartChildProcess at postmaster.c:3929:5
StartAutovacuumWorker at postmaster.c:3997:12
process_pm_pmsignal at postmaster.c:3809:3
ServerLoop at postmaster.c:1667:5
BackgroundWorkerInitializeConnection at postmaster.c:4156:1
main at main.c:184:3
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc71a8d7e40]
postgres: autovacuum worker regression(_start+0x25)[0x556a8631a5e5]
2024-05-10 12:46:45.038 UTC checkpointer[1881151] LOG:  checkpoint starting: immediate force wait
2024-05-10 12:46:45.965 UTC autovacuum worker[1881275] LOG: automatic analyze of table "regression.pg_catalog.pg_attribute"
        avg read rate: 0.000 MB/s, avg write rate: 5.409 MB/s
        buffer usage: 1094 hits, 0 misses, 27 dirtied
        system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.03 s
2024-05-10 12:46:46.892 UTC postmaster[1881150] LOG:  server process (PID 1881271) was terminated by signal 6: Aborted
2024-05-10 12:46:46.892 UTC postmaster[1881150] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE
public.hash_cleanup_heap

(This can be reproduced with 027_stream_regress, of course, but it would
take more time.)

Best regards,
Alexander

Attachment Content-Type Size
xloginsert-memchk.patch text/x-patch 1.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2024-05-10 13:21:20 Re: Fix parallel vacuum buffer usage reporting
Previous Message Melanie Plageman 2024-05-10 12:48:07 Re: open items