Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Denis Laxalde <denis(dot)laxalde(at)dalibo(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Date: 2024-11-21 21:33:06
Message-ID: CAGECzQQFvsuBcT1Bu3eDWDWuxt=GT0MsGJJKK_5og51Rk-4q7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Nov 2024 at 02:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Anyway, given that info, Jelte's unapplied 0002 patch earlier in the
> thread is not the answer, because this is about dropping a query
> cancel not about losing a timeout interrupt.

Agreed that 0002 does not fix the issue re-reported by Andres (let's
call this issue number 1). But I'm still of the opinion that 0002
fixes a real bug: i.e. a bug which causes timeouts.spec to randomly
fail[1] (let's call this issue number 2).
> This seems to fix the problem here. Thoughts?

Overall, a good approach to fix issue number 1. I think it would be
best if this was integrated into libpqsrv_cancel instead though. That
way the dblink would benefit from it too.

nit: Maybe call it RETRY_CANCEL_TIME. The RE_ prefix wasn't instantly
obvious what it meant to me, it seemed like an abbreviation when I first saw it.

> BTW, while I didn't do it in the attached, I'm tempted to greatly
> reduce the 100ms delay in query_cancel.sql. If this does make it
> more robust, we shouldn't need that much time anymore.

Seems sensible to me.

Finally there's a third issue[2] (let's call this issue number 3).
Alexander did some investigation into this issue too[3]. For this one
I have a hard time understanding what is going on, or at least how
this issue only seems to apply to cancel connections. From his
description of the problem and my reading of the code it seems that if
we fail to send the StartupPacket/CancelRequest due to a socket error,
we set the write_failed flag. But we don't actually check this flag
during the CONNECTION_AWAITING_RESPONSE phase of PQconnectPoll, so we
just wait until we reach a timeout because the server never sends us
anything.

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason
[2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#dblink.sql_.28and_postgres_fdw.sql.29_fail_on_Windows_due_to_the_cancel_packet_not_sent
[3]: https://www.postgresql.org/message-id/5ea25e4d-1ee2-b9bf-7806-119ffa658826@gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2024-11-21 21:39:58 Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
Previous Message yuansong 2024-11-21 21:26:53 Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST