Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)

From: Liam Bowen <liambowen(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)
Date: 2022-01-21 05:29:02
Message-ID: CAE7q7EiVWJ-pp76bK2wPcvSeio6o5s7OMToLiJCm1-dNd2c=SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I will try your fix tomorrow. That's a good idea. You asked about debug
information for the libintl stack frame. I wish this were easier, but it
looks like the Enterprise DB folks don't include a PDB file for that DLL,
although they do include them for almost every other module. In addition, I
have no way of knowing what revision or config switches they used to build
it. Any idea how I could contact someone at Enterprise DB that could help
us out here? I would really rather just get the symbols and source revision
from them, as building it myself on Windows is quite a process and may not
contain the same bug.

Thanks for a quick reply, Tom!

On Thu, Jan 20, 2022, 22:56 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Liam Bowen <liambowen(at)gmail(dot)com> writes:
> > I used the EnterpriseDB installer for several versions, and narrowed down
> > the bug as being introduced between 13.5 and 14.1. Then I used git bisect
> > to narrow down which revision actually introduced the bug. Each time, I
> > would build libpq and copy the DLL into the same directory as my
> executable
> > and verify that my build of libpq was being loaded. Eventually my
> bisection
> > pointed to 52a1022.
>
> Hmmm ....
>
> > Here is a debugging session:
> > https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18.
>
> So as with the original bug #17299 report, this is showing an abort()
> failure inside libintl, although yours is in a slightly different
> place. It would be interesting to look into the libintl code and
> try to understand what it's seeing as wrong. Can you get debug
> data out of those stack frames?
>
>
> My gut feeling right now is that the reason 52a1022 tickles this
> is that it causes us to invoke gettext() in the normal, successful
> PQconnect code paths. I'm too tired to go look, but I suspect that
> before that patch, a non-error connection would never perform any
> message translations (or at least would not do so in the most basic
> cases). So that would result in us hitting gettext() much more
> heavily than before.
>
> Given that you're only seeing this when performing concurrent
> connections (or at least that's how I understood what you said),
> I'm eyeing libpq_binddomain with a bit of suspicion. Does it
> help to move down the update of already_bound, like this?
>
> const char *ldir;
>
> - already_bound = true;
> /* No relocatable lookup here because the binary could be
> anywhere */
> ldir = getenv("PGLOCALEDIR");
> if (!ldir)
> ldir = LOCALEDIR;
> bindtextdomain(PG_TEXTDOMAIN("libpq"), ldir);
> + already_bound = true;
> #ifdef WIN32
> SetLastError(save_errno);
> #else
> errno = save_errno;
> #endif
>
> My thought here is that if two threads went through this code
> at about the same time, it'd be possible for one of them to
> decide that the bindtextdomain call had already happened when
> it had not. That could lead to that thread calling gettext
> and then the other one calling bindtextdomain later than that;
> and maybe libintl's response to that is as unfriendly as abort().
>
> The above quick-hack change would result in both threads calling
> bindtextdomain, which I'm guessing libintl is prepared to cope
> with (at least, its docs claim bindtextdomain is "MT-Safe").
> If it isn't really, then we might have to additionally add a
> critical section here.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexander Lakhin 2022-01-21 06:00:00 Re: BUG #17372: Altering statistics during concurrent drop can lead to a server crash
Previous Message Tom Lane 2022-01-21 03:56:57 Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)