Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date: 2022-01-09 11:00:00
Message-ID: c20eb022-f576-0cc5-d796-91ffaf10a2ca@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Tom,
09.01.2022 04:17, Tom Lane wrote:
>> So for some reason, on these machines detection of walsender-initiated
>> connection close is unreliable ... or maybe, the walsender didn't close
>> the connection, but is somehow still hanging around? Don't have much idea
>> where to dig beyond that, but maybe someone else will. I wonder in
>> particular if this could be related to our recent discussions about
>> whether to use shutdown(2) on Windows --- could we need to do the
>> equivalent of 6051857fc/ed52c3707 on walsender connections?
> ... wait a minute. After some more study of the buildfarm logs,
> it was brought home to me that these failures started happening
> just after 6051857fc went in:
>
> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&branch=&member=&stage=module-commit_tsCheck&filter=Submit
>
> The oldest matching failure is jacana's on 2021-12-03.
> (The above sweep finds an unrelated-looking failure on 2021-11-11,
> but no others before 6051857fc went in on 2021-12-02. Also, it
> looks likely that ed52c3707 on 2021-12-07 made the failure more
> probable, because jacana's is the only matching failure before 12-07.)
>
> So I'm now thinking it's highly likely that those commits are
> causing it somehow, but how?
>
I've managed to reproduce this failure too.
Removing "shutdown(MyProcPort->sock, SD_SEND);" doesn't help here, so
the culprit is exactly "closesocket(MyProcPort->sock);".
I've added `system("netstat -ano");` before die() in 002_standby.pl and see:
# Postmaster PID for node "primary" is 944
  Proto  Local Address          Foreign Address        State           PID
...
  TCP    127.0.0.1:58545        127.0.0.1:61995        FIN_WAIT_2      944
...
  TCP    127.0.0.1:61995        127.0.0.1:58545        CLOSE_WAIT      1352

(Replacing SD_SEND with SD_BOTH doesn't change the behaviour.)

Looking at the libpqwalreceiver.c:
        /* Now that we've consumed some input, try again */
        rawlen = PQgetCopyData(conn->streamConn, &conn->recvBuf, 1);
here we get -1 on the primary disconnection.
Then we get COMMAND_OK here:
        res = libpqrcv_PQgetResult(conn->streamConn);
        if (PQresultStatus(res) == PGRES_COMMAND_OK)
and finally just hang at:
            /* Verify that there are no more results. */
            res = libpqrcv_PQgetResult(conn->streamConn);
until the standby gets interrupted by the TAP test. (That call can also
return NULL and then the test completes successfully.)
Going down through the call chain, I see that at the end of it
WaitForMultipleObjects() hangs while waiting for the primary connection
socket event. So it looks like the socket, that is closed by the
primary, can get into a state unsuitable for WaitForMultipleObjects().
I tried to check the socket state with the WSAPoll() function and
discovered that it returns POLLHUP for the "problematic" socket.
The following draft addition in latch.c:
int
WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
                  long timeout, uint32 wait_event_info)
{
    int            ret = 0;
    int            rc;
    WaitEvent    event;

#ifdef WIN32
    if (wakeEvents & WL_SOCKET_MASK) {
        WSAPOLLFD pollfd;
        pollfd.fd = sock;
        pollfd.events = POLLRDNORM | POLLWRNORM;
        pollfd.revents = 0;
        int rc = WSAPoll(&pollfd, 1, 0);
        if ((rc == 1) && (pollfd.revents & POLLHUP)) {
            elog(WARNING, "WaitLatchOrSocket: A stream-oriented
connection was either disconnected or aborted.");
            return WL_SOCKET_MASK;
        }
    }
#endif

makes the test 002_stanby.pl pass (100 of 100 iterations, while without
the fix I get failures roughly on each third). I'm not sure where to
place this check, maybe it's better to move it up to
libpqrcv_PQgetResult() to minimize it's footprint or to find less
Windows-specific approach, but I'd prefer a client-side fix anyway, as
graceful closing a socket by a server seems a legitimate action.

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message sergei sh. 2022-01-09 11:28:20 Re: [PATCH] reduce page overlap of GiST indexes built using sorted method
Previous Message Tom Lane 2022-01-09 07:32:33 Re: null iv parameter passed to combo_init()