Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur
Date: 2017-05-15 17:54:57
Message-ID: CA+TgmoaPNOqtwOmXF-dNSBLvTBBdMouycKb2UxiJRRQu3134=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
<tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
>> I guess not as well. That would be tricky for the user to have a different
>> behavior depending on the error returned by the server, which is why the
>> current code is doing things right IMO. Now, the feature has been designed
>> similarly to JDBC with its parametrization, so it could be surprising for
>> users to get a different failure handling compared to that. Not saying that
>> JDBC is doing it wrong, but libpq does nothing wrong either.
>
> I didn't intend to make the user have a different behavior depending on the error returned by the server. I meant attempting connection to alternative hosts when the server returned an error. I thought the new libpq feature tries to connect to other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's perspective), not the *socket connection* (PG developer's perspective). I think PgJDBC meets the user's desire better -- "Please connect to some host for better HA if a database server is unavailable for some reason."
>
> By the way, could you elaborate what problem could occur if my solution is applied? (it doesn't seem easy for me to imagine...)

Sure. Imagine that the user thinks that 'foo' and 'bar' are the
relevant database servers for some service and writes 'dbname=quux
host=foo,bar' as a connection string. However, actually the user has
made a mistake and 'foo' is supporting some other service entirely; it
has no database 'quux'; the database servers which have database
'quux' are in fact 'bar' and 'baz'. All appears well as long as 'bar'
remains up, because the missing-database error for 'foo' is ignored
and we just connect to 'bar'. However, when 'bar' goes down then we
are out of service instead of failing over to 'baz' as we should have
done.

Now it's quite possible that the user, if they test carefully, might
realize that things are not working as intended, because the DBA might
say "hey, all of your connections are being directed to 'bar' instead
of being load-balanced properly!". But even if they are careful
enough to realize this, it may not be clear what has gone wrong.
Under your proposal, the connection to 'foo' could be failing for *any
reason whatsoever* from lack of connectivity to a missing database to
a missing user to a missing CONNECT privilege to an authentication
failure. If the user looks at the server log and can pick out the
entries from their own connection attempts they can figure it out, but
otherwise they might spend quite a bit of time wondering what's wrong;
after all, libpq will report no error, as long as the connection to
the other server works.

Now, this is all arguable. You could certainly say -- and you are
saying -- that this feature ought to be defined to retry after any
kind of failure whatsoever. But I think what Tom and Michael and I
are saying is that this is a failover feature and therefore ought to
try the next server when the first one in the list appears to have
gone down, but not when the first one in the list is unhappy with the
connection request for some other reason. Who is right is a judgement
call, but I don't think it's self-evident that users want to ignore
anything and everything that might have gone wrong with the connection
to the first server, rather than only those things which resemble a
down server. It seems quite possible to me that if we had defined it
as you are proposing, somebody would now be arguing for a behavior
change in the other direction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-05-15 18:13:56 Re: Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified
Previous Message Robert Haas 2017-05-15 17:32:15 Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified