Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Denis Laxalde <denis(dot)laxalde(at)dalibo(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Jelte Fennema <Jelte(dot)Fennema(at)microsoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Date: 2024-07-16 11:00:00
Message-ID: e4d0cb33-6be5-e4d5-ae49-9eac3ff2b005@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

25.06.2024 11:24, Jelte Fennema-Nio wrote:
> My expectation is that that should remove all failure cases. If it
> doesn't, I think our best bet is removing the test again.

It looks like that test eventually showed what could be called a virtue.
Please take a look at a recent BF failure [1]:
timed out after 10800 secs
...
# +++ regress install-check in contrib/postgres_fdw +++
# using postmaster on /home/andrew/bf/root/tmp/buildfarm-e2ahpQ, port 5878

So the postgres_fdw test hanged for several hours while running on the
Cygwin animal lorikeet.

I've managed to reproduce this issue in my Cygwin environment by running
the postgres_fdw test in a loop (10 iterations are enough to get the
described effect). And what I'm seeing is that a query-cancelling backend
is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() ->
pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() ->
WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
poll().

The timeout value (approximately 30 seconds), which is passed to poll(),
is effectively ignored by this call — the waiting lasts for unlimited time.

This definitely is caused by 2466d6654. (I applied the test change from that
commit to 2466d6654~1 and saw no issue when running the same test in a
loop.)

With gdb attached to a hanging backend, I see the following stack trace:
#0  0x00007ffb7f70d5e4 in ntdll!ZwWaitForSingleObject () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffb7d2e920e in WaitForSingleObjectEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ffb5ce78862 in fhandler_socket_wsock::evaluate_events (this=0x800126968, event_mask=50, events=(at)0x7ffffb208:
0, erase=erase(at)entry=false)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/fhandler/socket_inet.cc:268
#3  0x00007ffb5cdef0f5 in peek_socket (me=0xa001a43c0) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:1771
#4  0x00007ffb5cdf211e in select_stuff::poll (this=this(at)entry=0x7ffffb300, readfds=0x7ffffb570,
readfds(at)entry=0x800000000, writefds=0x7ffffb560, writefds(at)entry=0x7ffffb5c0, exceptfds=0x7ffffb550,
    exceptfds(at)entry=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>) at
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:554
#5  0x00007ffb5cdf257e in select (maxfds=maxfds(at)entry=45, readfds=0x800000000, writefds=0x7ffffb5c0,
exceptfds=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>, us=4308570016,
    us(at)entry=29973000) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:204
#6  0x00007ffb5cdf2927 in pselect (maxfds=45, readfds=0x7ffffb570, writefds=0x7ffffb560, exceptfds=0x7ffffb550,
to=<optimized out>, to(at)entry=0x7ffffb500, set=<optimized out>, set(at)entry=0x0)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:120
#7  0x00007ffb5cdf2c97 in cygwin_select (maxfds=<optimized out>, readfds=<optimized out>, writefds=<optimized out>,
exceptfds=<optimized out>, to=0x7ffffb5b0)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:147
#8  0x00007ffb5cddc112 in poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/poll.cc:83
...
and socket_inet.c:268 ([2]) indeed contains an infinite wait call
(LOCK_EVENTS; / WaitForSingleObject (wsock_mtx, INFINITE)).

So it looks like a Cygwin bug, but maybe something should be done on our side
too, at least to prevent such lorikeet failures.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-12%2010%3A05%3A27
[2] https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/fhandler/socket_inet.cc?h=cygwin-3.5.3

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2024-07-16 11:11:20 Re: Use read streams in CREATE DATABASE command when the strategy is wal_log
Previous Message Rajan Pandey 2024-07-16 10:49:44 Why is 'use_alias' hardcoded to true in deparseFromExprForRel() for some cases