Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Liam Bowen <liambowen(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)
Date: 2022-01-21 03:56:57
Message-ID: 1105077.1642737417@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Liam Bowen <liambowen(at)gmail(dot)com> writes:
> I used the EnterpriseDB installer for several versions, and narrowed down
> the bug as being introduced between 13.5 and 14.1. Then I used git bisect
> to narrow down which revision actually introduced the bug. Each time, I
> would build libpq and copy the DLL into the same directory as my executable
> and verify that my build of libpq was being loaded. Eventually my bisection
> pointed to 52a1022.

Hmmm ....

> Here is a debugging session:
> https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18.

So as with the original bug #17299 report, this is showing an abort()
failure inside libintl, although yours is in a slightly different
place. It would be interesting to look into the libintl code and
try to understand what it's seeing as wrong. Can you get debug
data out of those stack frames?

My gut feeling right now is that the reason 52a1022 tickles this
is that it causes us to invoke gettext() in the normal, successful
PQconnect code paths. I'm too tired to go look, but I suspect that
before that patch, a non-error connection would never perform any
message translations (or at least would not do so in the most basic
cases). So that would result in us hitting gettext() much more
heavily than before.

Given that you're only seeing this when performing concurrent
connections (or at least that's how I understood what you said),
I'm eyeing libpq_binddomain with a bit of suspicion. Does it
help to move down the update of already_bound, like this?

const char *ldir;

- already_bound = true;
/* No relocatable lookup here because the binary could be anywhere */
ldir = getenv("PGLOCALEDIR");
if (!ldir)
ldir = LOCALEDIR;
bindtextdomain(PG_TEXTDOMAIN("libpq"), ldir);
+ already_bound = true;
#ifdef WIN32
SetLastError(save_errno);
#else
errno = save_errno;
#endif

My thought here is that if two threads went through this code
at about the same time, it'd be possible for one of them to
decide that the bindtextdomain call had already happened when
it had not. That could lead to that thread calling gettext
and then the other one calling bindtextdomain later than that;
and maybe libintl's response to that is as unfriendly as abort().

The above quick-hack change would result in both threads calling
bindtextdomain, which I'm guessing libintl is prepared to cope
with (at least, its docs claim bindtextdomain is "MT-Safe").
If it isn't really, then we might have to additionally add a
critical section here.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Liam Bowen 2022-01-21 05:29:02 Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)
Previous Message Liam Bowen 2022-01-21 02:00:22 Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)