Quick Links

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date:	2022-01-14 08:00:00
Message-ID:	2174557f-ead8-430d-a07e-d92b50e69b65@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

13.01.2022 09:36, Thomas Munro wrote:
> Here's a draft attempt at a fix. First I tried to use recv(fd, &c, 1,
> MSG_PEEK) == 0 to detect EOF, which seemed to me to be a reasonable
> enough candidate, but somehow it corrupts the stream (!?), so I used
> Alexander's POLLHUP idea, except I pushed it down to a more principled
> place IMHO. Then I suppressed it after the initial check because then
> the logic from my earlier patch takes over, so stuff like FeBeWaitSet
> doesn't suffer from extra calls, only these two paths that haven't
> been converted to long-lived WESes yet. Does this pass the test?
Yes, this fix eliminates the flakiness for me. The test commit_ts (with
002_standby and 003_standby_2) passed 2x200 iterations.
It also makes my test postgres_fdw/001_disconnection pass reliably.
> I wonder if this POLLHUP technique is reliable enough (I know that
> wouldn't work on other systems[1], which is why I was trying to make
> MSG_PEEK work...).
>
> What about environment variable PG_TEST_USE_UNIX_SOCKETS=1, does it
> reproduce with that set, and does the patch fix it? I'm hoping that
> explains some Windows CI failures from a nearby thread[2].
With PG_TEST_USE_UNIX_SOCKETS=1 the test commit_ts/002_standby fails on
the unpatched HEAD:
t/002_standby.pl .... 1/4 # poll_query_until timed out executing this query:
# SELECT '0/303C628'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 25 just after 1.
t/002_standby.pl .... Dubious, test returned 25 (wstat 6400, 0x1900)

002_standby_primary.log contains:
2022-01-13 18:57:32.925 PST [1448] LOG: starting PostgreSQL 15devel,
compiled by Visual C++ build 1928, 64-bit
2022-01-13 18:57:32.926 PST [1448] LOG: listening on Unix socket
"C:/Users/1/AppData/Local/Temp/yOKQPH1FoO/.s.PGSQL.62733"

The same with my postgres_fdw test:
# 03:41:12.533246 result:       0
# 0|0
# 03:41:12.534758 executing query...
# 03:41:14.267594 result:       3
#
# psql:<stdin>:1: WARNING: no connection to the server
# psql:<stdin>:1: ERROR: FATAL: terminating connection due to
administrator command
# server closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# CONTEXT: remote SQL command: FETCH 100 FROM c1
# 03:41:14.270449 executing query...
# 03:41:14.334437 result:       3
#
# psql:<stdin>:1: ERROR: could not connect to server "fpg"
# DETAIL: connection to server on socket
"C:/Users/1/AppData/Local/Temp/hJWD9mzPHM/.s.PGSQL.57414" failed:
Connection refused (0x0000274D/10061)
#       Is the server running locally and accepting connections on that
socket?
# 03:41:14.336918 executing query...
# 03:41:14.422381 result:       3
#
# psql:<stdin>:1: ERROR: could not connect to server "fpg"
# DETAIL: connection to server on socket
"C:/Users/1/AppData/Local/Temp/hJWD9mzPHM/.s.PGSQL.57414" failed:
Connection refused (0x0000274D/10061)
#       Is the server running locally and accepting connections on that
socket?
# 03:41:14.425628 executing query...
...hang...

With the patch these tests pass successfully.

I can also confirm that on Windows 10 20H2 (previous tests were
performed on Windows Server 2012) the unpatched HEAD +
PG_TEST_USE_UNIX_SOCKETS=1 hangs on src\test\recovery\001_stream_rep (on
iterations 1, 1, 4 for me).
(v9-0001-Add-option-for-amcheck-and-pg_amcheck-to-check-un.patch [1] not
required to see that.)
001_stream_rep_primary.log contains:
...
2022-01-13 19:46:48.287 PST [1364] LOG: listening on Unix socket
"C:/Users/1/AppData/Local/Temp/EWzapwiXjV/.s.PGSQL.58248"
2022-01-13 19:46:48.317 PST [6736] LOG: database system was shut down
at 2022-01-13 19:46:37 PST
2022-01-13 19:46:48.331 PST [1364] LOG: database system is ready to
accept connections
2022-01-13 19:46:49.536 PST [1088] 001_stream_rep.pl LOG: statement:
CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a
2022-01-13 19:46:49.646 PST [3028] 001_stream_rep.pl LOG: statement:
SELECT pg_current_wal_insert_lsn()
2022-01-13 19:46:49.745 PST [3360] 001_stream_rep.pl LOG: statement:
SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
...
2022-01-13 19:50:39.755 PST [4924] 001_stream_rep.pl LOG: statement:
SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
2022-01-13 19:50:39.928 PST [5924] 001_stream_rep.pl LOG: statement:
SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';

Without PG_TEST_USE_UNIX_SOCKETS=1 and without the fix the
001_stream_rep hangs too (but on iterations 3, 8, 2). So it seems that
using unix sockets increases the fail rate.

With the fix 100 iterations with PG_TEST_USE_UNIX_SOCKETS=1 and 40 (and
still counting) iterations without PG_TEST_USE_UNIX_SOCKETS pass.

So the fix looks as absolutely working to me with the tests that we have
for now.

[1]
https://www.postgresql.org/message-id/CALT9ZEHx2%2B9rqAeAANkUXJCsTueQqdx2Tt6ypaig9tozJkWvkQ%40mail.gmail.com

In response to

Re: Why is src/test/modules/committs/t/002_standby.pl flaky? at 2022-01-13 06:36:01 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Julien Rouhaud	2022-01-14 08:03:41	Re: MDAM techniques and Index Skip Scan patch
Previous Message	Dmitry Dolgov	2022-01-14 07:55:26	Re: MDAM techniques and Index Skip Scan patch