Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Denis Laxalde <denis(dot)laxalde(at)dalibo(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Date: 2024-11-21 01:30:54
Message-ID: 321476.1732152654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> 17.11.2024 05:33, Tom Lane wrote:
>> Yeah. This has been happening off-and-on in the buildfarm ever
>> since we added that test. I'm not sure if it's just "the test
>> is unstable" or if it's telling us there's a problem with the
>> cancel logic. Scraping the last 3 months worth of buildfarm
>> logs finds these instances:

> Yes, I counted those bf failures at [1] too and posted my explanation
> upthread [2].

Sorry, I'd forgotten about that. I added some more debug logging
to the modifications you made, and confirmed your theory that the
remote session is ignoring the cancel request because it receives it
while DoingCommandRead is true; which must mean that it hasn't started
the slow query yet.

This implies that the 100ms delay in query_cancel.sql is not reliably
enough for the remote to receive the command, which surprises me,
especially since the failing animals aren't particularly slow ones.
Maybe there is something else happening? But I do reproduce the
failure after adding your delays, and the patch I'm about to propose
does fix it.

Anyway, given that info, Jelte's unapplied 0002 patch earlier in the
thread is not the answer, because this is about dropping a query
cancel not about losing a timeout interrupt. The equivalent thing
to what he suggested would be to not clear the cancel request flag
during DoingCommandRead, instead letting it kill the next query.
But I didn't like the idea for timeouts, and I like it even less for
query cancel. What I think we should do instead is to re-issue
the cancel request if we've waited a little and nothing came of it.
This corresponds more or less to what a human user would likely do
(or at least this human would). The attached patch is set up to
re-cancel after 1 second, then 2 more seconds, then 4 more, etc
until we reach the 30-second "it's dead Jim" threshold.

This seems to fix the problem here. Thoughts?

BTW, while I didn't do it in the attached, I'm tempted to greatly
reduce the 100ms delay in query_cancel.sql. If this does make it
more robust, we shouldn't need that much time anymore.

regards, tom lane

Attachment Content-Type Size
reissue-cancel-requests.patch text/x-diff 7.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-11-21 02:15:56 Re: 039_end_of_wal: error in "xl_tot_len zero" test
Previous Message Thomas Munro 2024-11-21 01:24:26 Re: ci: Macos failures due to MacPorts behaviour change