From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | stats.sql might fail due to shared buffers also used by parallel tests |
Date: | 2025-01-06 07:00:00 |
Message-ID: | fe0391a8-dfa9-41c3-bf1c-7ea058e40f30@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
I'd like to share my investigation of one mysterious stats.sql failure
occurred in December: [1].
The difference of the failure is:
--- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/stats.out 2024-09-18 19:31:14.665516500 +0000
+++ C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-17
09:57:08.944588500 +0000
@@ -1291,7 +1291,7 @@
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
?column?
----------
- t
+ f
(1 row)
027_stream_regress_primary.log contains:
2024-12-17 09:57:06.778 UTC [8568:109] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation'
2024-12-17 09:57:06.779 UTC [8568:110] pg_regress/stats LOG: statement: CREATE TABLE test_io_shared(a int);
2024-12-17 09:57:06.780 UTC [8568:111] pg_regress/stats LOG: statement: INSERT INTO test_io_shared SELECT i FROM
generate_series(1,100)i;
...
2024-12-17 09:57:06.782 UTC [8568:115] pg_regress/stats LOG: statement: CHECKPOINT;
2024-12-17 09:57:06.794 UTC [3664:40] LOG: checkpoint starting: immediate force wait
2024-12-17 09:57:06.856 UTC [3664:41] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.062 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=1875 kB, estimate=52682 kB; lsn=0/14A2F410, redo lsn=0/14A2F3B8
2024-12-17 09:57:06.857 UTC [8568:116] pg_regress/stats LOG: statement: CHECKPOINT;
2024-12-17 09:57:06.857 UTC [3664:42] LOG: checkpoint starting: immediate force wait
2024-12-17 09:57:06.859 UTC [3664:43] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=0 kB, estimate=47414 kB; lsn=0/14A2F4E0, redo lsn=0/14A2F488
2024-12-17 09:57:06.859 UTC [8568:117] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation'
2024-12-17 09:57:06.860 UTC [8568:118] pg_regress/stats LOG: statement: SELECT 77693 > 77693;
The corresponding test case (added by 10a082bf7, dated 2023-02-11) is:
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation' \gset io_sum_shared_before_
...
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
...
CHECKPOINT;
CHECKPOINT;
...
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_get_backend_io(pg_backend_pid())
WHERE object = 'relation' \gset my_io_sum_shared_after_
SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
So, as we can see from the log above, both checkpoints wrote 0 buffers,
moreover, no other process/backend wrote any buffers.
After series of experiments, I found that this situation is possible when
other backend steals shared buffers from checkpointer, without making new
buffers dirty, and then delays flushing it's statistics.
Please find attached a reproducer for this issue: I've added two sleeps
to stats.sql and reduced parallel_schedule, but I believe it's a legal
change, which doesn't affect the test case essentially.
With bufchurn.sql running concurrently. and TEMP_CONFIG containing
shared_buffers = 1MB (as in the 027_stream_regress test in question), I can
see exactly the same failure:
$ echo 'shared_buffers = 1MB' >/tmp/extra.config; TEMP_CONFIG=/tmp/extra.config make -s check
# +++ regress check in src/test/regress +++
# initializing database system by copying initdb template
# using temp instance on port 65312 with PID 129700
ok 1 - test_setup 356 ms
ok 2 - create_index 1002 ms
# parallel group (2 tests): stats bufchurn
not ok 3 + stats 21176 ms
...
diff -U3 .../src/test/regress/expected/stats.out .../src/test/regress/results/stats.out
--- .../src/test/regress/expected/stats.out 2025-01-06 08:03:07.720275980 +0200
+++ .../src/test/regress/results/stats.out 2025-01-06 08:15:34.602170721 +0200
@@ -1321,7 +1321,7 @@
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
?column?
----------
- t
+ f
(1 row)
...
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-17%2008%3A59%3A44
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
stats-bufs-write-repro.patch | text/x-patch | 8.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Zhou, Zhiguo | 2025-01-06 07:05:02 | Re: RFC: Lock-free XLog Reservation from WAL |
Previous Message | Zhou, Zhiguo | 2025-01-06 06:46:03 | Re: [RFC] Lock-free XLog Reservation from WAL |