Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

From: "Tels" <nospam-pg-abuse(at)bloodgate(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Michael Paquier" <michael(dot)paquier(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur
Date: 2017-05-17 17:52:49
Message-ID: 3124955edc7b555878be4e2a98b6ef66.squirrel@sm.webmail.pair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Moin,

On Wed, May 17, 2017 12:34 pm, Robert Haas wrote:
> On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
> <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
>> What do you think of the following cases? Don't you want to connect to
>> other servers?
>>
>> * The DBA shuts down the database. The server takes a long time to do
>> checkpointing. During the shutdown checkpoint, libpq tries to connect
>> to the server and receive an error "the database system is shutting
>> down."
>>
>> * The former primary failed and now is trying to start as a standby,
>> catching up by applying WAL. During the recovery, libpq tries to
>> connect to the server and receive an error "the database system is
>> performing recovery."
>>
>> * The database server crashed due to a bug. Unfortunately, the server
>> takes unexpectedly long time to shut down because it takes many seconds
>> to write the stats file (as you remember, Tom-san experienced 57 seconds
>> to write the stats file during regression tests.) During the stats file
>> write, libpq tries to connect to the server and receive an error "the
>> database system is shutting down."
>>
>> These are equivalent to server failure. I believe we should prioritize
>> rescuing errors during operation over detecting configuration errors.
>
> Yeah, you have a point. I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.
>
> Anybody else want to weigh in with an opinion here?

Hm, to me the feature needs to be reliable (for certain values of
reliable) to be usefull.

Consider that you have X hosts (rendundancy), and a lot of applications
that want a stable connection to the one that (still) works, whichever
this is.

You can then either:

1. make one primary, the other standby(s) and play DNS tricks or similiar
to make it appear that there is only one working host, and have all apps
connect to the "one host" (and reconnect to it upon failure)

2. let each app try each host until it finds a working one, if the
connection breaks, retry with the next host

3. or use libpq and let it try the hosts for you.

However, if I understand it correctly, #3 only works reliable in certain
cases (e.g. host down), but not if it is "sort of down". In that case each
app would again need code to retry different hosts until it finds a
working one, instead of letting libpq do the work.

That sound hard to deploy #3 in praxis, as you might easily just code up
#1 or #2 and call it a day.

All the best,

Tels

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Ladhe 2017-05-17 17:58:39 Re: remove unnecessary flag has_null from PartitionBoundInfoData
Previous Message Tom Lane 2017-05-17 17:35:22 pgindent (was Re: [COMMITTERS] pgsql: Preventive maintenance in advance of pgindent run.)