Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Fujii Masao <fujii(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect
Date: 2020-10-11 00:16:16
Message-ID: 2943611.1602375376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> writes:
>>> Therefore, the easy fix is to make libpq mark the connection as
>>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.

So in the wake of commit fe27009cb, this is what lorikeet is doing:

@@ -9028,9 +9028,7 @@
CALL terminate_backend_and_wait('fdw_retry_check');
SAVEPOINT s;
SELECT 1 FROM ft1 LIMIT 1; -- should fail
-ERROR: server closed the connection unexpectedly
- This probably means the server terminated abnormally
- before or while processing the request.
+ERROR: could not receive data from server: Software caused connection abort
CONTEXT: remote SQL command: SAVEPOINT s2
COMMIT;
-- Clean up

which is better --- the connection recovery is working --- but
obviously still not a "pass".

The only way to make this test pass as-is is to hack libpq so that
it deems ECONNABORTED to be a server crash, which frankly I do not
think is a good change. I don't see any principled reason to think
that it means that rather than a network connectivity failure.

What I've done instead as a stopgap is to adjust the test case to be
insensitive to the exact error message text. Maybe there's a better
way, but I can't think of anything besides having two (or more?)
expected-output files. That would be quite unpalatable as things
stand, though perhaps we could make it tolerable by splitting this
test off into a second test script.

Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
happy with it:

* The control flow seems rather forced. I think it was designed
on the assumption that reindenting the existing code is forbidden.
Why not use an actual loop, instead of a goto? I also think that
it's far from obvious that the loop isn't an infinite loop; it
took me quite a while to glom onto the fact that the test inside the
PG_CATCH avoids that by checking retry_conn. Maybe a comment would
be enough to improve that, but I feel the control logic could stand
a rethink.

* The CATCH case is completely failing to clean up after itself.
At the minimum, there has to be a FlushErrorState() there.
I don't think it'd be terribly hard to drive this code to an
error-stack-overflow PANIC.

* As is so often true of proposed patches in which PG_CATCH is
thought to be an adequate way to catch an error, this is just
unbelievably fragile. It will break, and badly so, if it catches
an error that is anything other than what it is expecting ...
and it's not even particularly trying to verify that the error is
what it's expecting. It might be okay, or at least a little bit
harder to break, if it verified that the error's SQLSTATE is
ERRCODE_CONNECTION_FAILURE.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Noah Misch 2020-10-11 05:10:43 Re: powerpc pg_atomic_compare_exchange_u32_impl: error: comparison of integer expressions of different signedness (Re: pgsql: For all ppc compilers, implement compare_exchange and) fetch_add
Previous Message Tom Lane 2020-10-10 23:57:32 pgsql: Band-aid new postgres_fdw test case to remove error text depende