stats.sql might fail due to shared buffers also used by parallel tests

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: stats.sql might fail due to shared buffers also used by parallel tests
Date: 2025-01-06 07:00:00
Message-ID: fe0391a8-dfa9-41c3-bf1c-7ea058e40f30@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I'd like to share my investigation of one mysterious stats.sql failure
occurred in December: [1].
The difference of the failure is:
--- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/stats.out 2024-09-18 19:31:14.665516500 +0000
+++ C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-17
09:57:08.944588500 +0000
@@ -1291,7 +1291,7 @@
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
  ?column?
 ----------
- t
+ f
 (1 row)

027_stream_regress_primary.log contains:
2024-12-17 09:57:06.778 UTC [8568:109] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
      FROM pg_stat_io
      WHERE object = 'relation'
2024-12-17 09:57:06.779 UTC [8568:110] pg_regress/stats LOG: statement: CREATE TABLE test_io_shared(a int);
2024-12-17 09:57:06.780 UTC [8568:111] pg_regress/stats LOG: statement: INSERT INTO test_io_shared SELECT i FROM
generate_series(1,100)i;
...
2024-12-17 09:57:06.782 UTC [8568:115] pg_regress/stats LOG: statement: CHECKPOINT;
2024-12-17 09:57:06.794 UTC [3664:40] LOG:  checkpoint starting: immediate force wait
2024-12-17 09:57:06.856 UTC [3664:41] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.062 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=1875 kB, estimate=52682 kB; lsn=0/14A2F410, redo lsn=0/14A2F3B8
2024-12-17 09:57:06.857 UTC [8568:116] pg_regress/stats LOG: statement: CHECKPOINT;
2024-12-17 09:57:06.857 UTC [3664:42] LOG:  checkpoint starting: immediate force wait
2024-12-17 09:57:06.859 UTC [3664:43] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=0 kB, estimate=47414 kB; lsn=0/14A2F4E0, redo lsn=0/14A2F488
2024-12-17 09:57:06.859 UTC [8568:117] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
      FROM pg_stat_io
      WHERE object = 'relation'
2024-12-17 09:57:06.860 UTC [8568:118] pg_regress/stats LOG: statement: SELECT 77693 > 77693;

The corresponding test case (added by 10a082bf7, dated 2023-02-11) is:
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
  FROM pg_stat_io
  WHERE object = 'relation' \gset io_sum_shared_before_
...
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
...
CHECKPOINT;
CHECKPOINT;
...
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
  FROM pg_stat_get_backend_io(pg_backend_pid())
  WHERE object = 'relation' \gset my_io_sum_shared_after_
SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;

So, as we can see from the log above, both checkpoints wrote 0 buffers,
moreover, no other process/backend wrote any buffers.

After series of experiments, I found that this situation is possible when
other backend steals shared buffers from checkpointer, without making new
buffers dirty, and then delays flushing it's statistics.

Please find attached a reproducer for this issue: I've added two sleeps
to stats.sql and reduced parallel_schedule, but I believe it's a legal
change, which doesn't affect the test case essentially.

With bufchurn.sql running concurrently. and TEMP_CONFIG containing
shared_buffers = 1MB (as in the 027_stream_regress test in question), I can
see exactly the same failure:
$ echo 'shared_buffers = 1MB' >/tmp/extra.config; TEMP_CONFIG=/tmp/extra.config make -s check
# +++ regress check in src/test/regress +++
# initializing database system by copying initdb template
# using temp instance on port 65312 with PID 129700
ok 1         - test_setup                                356 ms
ok 2         - create_index                             1002 ms
# parallel group (2 tests):  stats bufchurn
not ok 3     + stats                                   21176 ms
...

diff -U3 .../src/test/regress/expected/stats.out .../src/test/regress/results/stats.out
--- .../src/test/regress/expected/stats.out  2025-01-06 08:03:07.720275980 +0200
+++ .../src/test/regress/results/stats.out   2025-01-06 08:15:34.602170721 +0200
@@ -1321,7 +1321,7 @@
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
  ?column?
 ----------
- t
+ f
 (1 row)
...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-17%2008%3A59%3A44

Best regards,
Alexander

Attachment Content-Type Size
stats-bufs-write-repro.patch text/x-patch 8.8 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhou, Zhiguo 2025-01-06 07:05:02 Re: RFC: Lock-free XLog Reservation from WAL
Previous Message Zhou, Zhiguo 2025-01-06 06:46:03 Re: [RFC] Lock-free XLog Reservation from WAL