Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Denis Laxalde <denis(dot)laxalde(at)dalibo(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Jelte Fennema <Jelte(dot)Fennema(at)microsoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Date: 2024-06-22 15:00:01
Message-ID: 4db099c8-4a52-3cc4-e970-14539a319466@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

30.03.2024 01:17, Noah Misch wrote:
> On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote:
>> On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Could we make this test bulletproof by using an injection point?
>>> If not, I remain of the opinion that we're better off without it.
>> Possibly, and if so, I agree that would be better than the currently
>> added test. But I honestly don't feel like spending the time on
>> creating such a test.
> The SQL test is more representative of real applications, and it's way simpler
> to understand. In general, I prefer 6-line SQL tests that catch a problem 10%
> of the time over injection point tests that catch it 100% of the time. For
> low detection rate to be exciting, it needs to be low enough to have a serious
> chance of all buildfarm members reporting green for the bad commit. With ~115
> buildfarm members running in the last day, 0.1% detection rate would have been
> low enough to bother improving, but 4% would be high enough to call it good.

As a recent buildfarm failure on orlingo (which tests asan-enabled builds)
[1] shows, that test can still fail:
70/70 postgresql:postgres_fdw-running / postgres_fdw-running/regress               ERROR           278.67s exit status 1

@@ -2775,6 +2775,7 @@
 SET LOCAL statement_timeout = '10ms';
 select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
 ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
 COMMIT;

(from the next run we can see normal duration:
"postgres_fdw-running/regress               OK                6.30s ")

I reproduced the failure with an asan-enabled build on a slowed-down VM
and as far as I can see, it's caused by the following condition in
ProcessInterrupts():
                /*
                 * If we are reading a command from the client, just ignore the cancel
                 * request --- sending an extra error message won't accomplish
                 * anything.  Otherwise, go ahead and throw the error.
                 */
                if (!DoingCommandRead)
                {
                        LockErrorCleanup();
                        ereport(ERROR,
(errcode(ERRCODE_QUERY_CANCELED),
                                         errmsg("canceling statement due to user request")));
                }

I think this failure can be reproduced easily (without asan/slowing down)
with this modification:
@@ -4630,6 +4630,7 @@ PostgresMain(const char *dbname, const char *username)
             idle_session_timeout_enabled = false;
         }

+if (rand() % 10 == 0) pg_usleep(10000);
         /*
          * (5) disable async signal conditions again.
          *

Running this test in a loop (for ((i=1;i<=100;i++)); do \
echo "iteration $i"; make -s check -C contrib/postgres_fdw/ || break; \
done), I get:
...
iteration 56
# +++ regress check in contrib/postgres_fdw +++
# initializing database system by copying initdb template
# using temp instance on port 55312 with PID 991332
ok 1         - postgres_fdw                            20093 ms
1..1
# All 1 tests passed.
iteration 57
# +++ regress check in contrib/postgres_fdw +++
# initializing database system by copying initdb template
# using temp instance on port 55312 with PID 992152
not ok 1     - postgres_fdw                            62064 ms
1..1
...
--- .../contrib/postgres_fdw/expected/postgres_fdw.out 2024-06-22 02:52:42.991574907 +0000
+++ .../contrib/postgres_fdw/results/postgres_fdw.out 2024-06-22 14:43:43.949552927 +0000
@@ -2775,6 +2775,7 @@
 SET LOCAL statement_timeout = '10ms';
 select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
 ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
 COMMIT;

I also came across another failure of the test:
@@ -2774,7 +2774,7 @@
 BEGIN;
 SET LOCAL statement_timeout = '10ms';
 select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request
 COMMIT;

which is reproduced with a sleep added here:
@@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
      */
     parsetree_list = pg_parse_query(query_string);
+pg_usleep(11000);

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-06-20%2009%3A52%3A04

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-06-22 15:04:11 Re: replace strtok()
Previous Message Peter Geoghegan 2024-06-22 14:53:31 Re: FreezeLimit underflows in pg14 and 15 causing incorrect behavior in heap_prepare_freeze_tuple