Re: [Proposal] Add foreign-server health checks infrastructure

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com" <Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com>, "zyu(at)yugabyte(dot)com" <zyu(at)yugabyte(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-10-12 13:30:34
Message-ID: CACawEhW19nPfbFpvfke9eidFDxAy+ic36wmY0s936T=xzxgHog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Sounds reasonable. Do you mean that we can add additional GUC like
> "postgres_fdw.initial_check",
> wait WL_SOCKET_CLOSED if the conneciton is found in the hash table, and do
> reconnection if it might be closed, right?
>
>
Alright, it took me sometime to realize that postgres_fdw already has a
retry mechanism if the first command fails: postgres_fdw: reestablish new
connection if cached one is detected as… · postgres/postgres(at)32a9c0b
(github.com)
<https://github.com/postgres/postgres/commit/32a9c0bdf493cf5fc029ab44a22384d547290eff>

Still, the reestablish mechanism can be further simplified with
WL_SOCKET_CLOSED event such as the following (where we should probably
rename pgfdw_connection_check_internal):

/*
> * If the connection needs to be remade due to invalidation, disconnect as
> * soon as we're out of all transactions.
> */
>
* | +bool remoteSocketIsClosed = entry->conn != NULL : *
pgfdw_connection_check_internal*(entry->conn) : false;*

> if (entry->conn != NULL && (entry->invalidated || remoteSocketIsClosed) &&
> entry->xact_depth == 0)

{
> elog(DEBUG3, "closing connection %p for option changes to take effect",
> entry->conn);
> disconnect_pg_server(entry);
> }

* | +else if (remoteSocketIsClosed && && entry->xact_depth > 0)*
* | + error ("Remote Server is down ...")*

In other words, a variation of pgfdw_connection_check_internal()
could potentially go into interfaces/libpq/libpq-fe.h
(backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c).
Then, GetConnection() in postgres_fdw, it can force to reconnect as it is
already done for some cases or error properly:

>
> Based on off and on discussions, I modified my patch.
>
>
I still think that it is probably too much work/code to detect the
mentioned use-case you described on [1]. Each backend constantly
calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound
the optimal way to approach the "check whether server down" problem. You
typically try to decide whether a server is down by establishing a
connection (or ping etc), not going over all the existing connections.

As far as I can think of, it should probably be a single background task
checking whether the server is down. If so, sending an invalidation message
to all the backends such that related backends could act on the
invalidation and throw an error. This is to cover the use-case you
described on [1].

Also, maybe we could have a new catalog table like pg_foreign_server_health
or such, where we can keep the last time the check succeeded (and/or
failed), and how many times the check succeeded (and/or failed).

This is of course how I would approach this problem. I think some other
perspectives on this would be very useful to hear.

Thanks,
Onder KALACI

[1]
https://www.postgresql.org/message-id/TYAPR01MB58662809E678253B90E82CE5F5889%40TYAPR01MB5866.jpnprd01.prod.outlook.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhang Mingli 2022-10-12 13:39:11 Re: Issue in GIN fast-insert: XLogBeginInsert + Read/LockBuffer ordering
Previous Message Pavel Stehule 2022-10-12 13:26:46 Re: Schema variables - new implementation for Postgres 15