Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date: 2025-02-10 16:23:30
Message-ID: CAOYmi+=azjJ064LfZG0pJ6DBjx9Ro0Kzdd+Z2Jyyy=66cqL+8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 6, 2025 at 12:35 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> /* Update app name to current GUC setting */
> + /* TODO: ask the list: maybe do this before setting STATE_UNDEFINED? */
> if (application_name)
> pgstat_report_appname(application_name);
>
> Historically, we've always reported the application name after
> STATE_UNDEFINED is set, never the reverse. There could be potential
> security implications if we were to begin reporting the application
> name before the connection has fully authenticated because it would be
> possible for attempted connections to show malicious data in the
> catalogs, and now we're sure that the catalog data is OK to query for
> other users. Let's be careful about that. I think that we should
> still set that as late as possible in pgstat_bestart_final(), never at
> an earlier step as this data can come from a connection
> string, potentially malicious if not authenticated yet.

I don't want to change the order of authentication and application
name, just move the application name report up above the state change
in pgstat_bestart_final(). I don't feel strongly about it, though.

> + if (!bootstrap)
> + {
> + pgstat_bestart_initial();
> + if (MyProcPort)
> + pgstat_bestart_security(); /* fill in any SSL/GSS info too */
>
> This part of patch 0001 is giving me a very long pause. What's the
> merit of filling in twice the backend entry data if we're going to
> update it at the end anyway for normal backend connections? More info
> for debugging, is it?

Correct; if you already know authentication information from the
transport then it seemed nice to fill it in before doing the stuff
that hangs.

> It seems that removing this call does not
> influence the tests you are adding, as well (or should the test
> "pg_stat_ssl view is updated" be responsible for checking that?).

The test is supposed to enforce that, but I see that it's not for some
reason. That's concerning. I'll investigate, thanks for pointing it
out.

> With this patch, the information that we get to be able to debug a
> backend entry in pg_stat_activity is st_clientaddr and
> remote_hostname. If we have a backend stuck in a wait event for the
> "Auth" class, would these be sufficient to provide a good user
> experience? Still better than nothing as we don't know the database
> and user ID used for the connection until authentication is done, I
> guess, to be able to grab patterns behind authentication getting
> stuck.

It's better than nothing, but is there a particular reason not to
trust the crypto the first time around? The SSL/GSS details are what
they are; if you don't trust them at either point one or point two
then I think we need to have a more urgent conversation about that?

> In patch 0002, WAIT_EVENT_SSPI_TRANSLATE_NAME is used twice for two
> code paths. Perhaps these should be two separate wait events?

That was my question from v6:

> I violated the "one event name per call site" rule with
> TranslateName(). The call pattern there is "call once to figure out
> the buffer length, then call again to fill it in", and IMO that didn't
> deserve differentiation. But if anyone objects, I'm happy to change it
> (and I'd appreciate some name suggestions in that case).

So -- any name suggestions? :D (Personally, I viewed this sort of like
an unrolled loop of two. I don't know if it helps to know which call
is hanging as long as you know that TranslateName is hanging.)

> The events attached to ldap_unbind() make that also kind of hard to
> debug. What's the code path actually impacted if we see them in
> pg_stat_activity? We have nine of them with the same wait event
> attached to them.

Do we _want_ nine separate flavors of WAIT_EVENT_LDAP_UNBIND? I
figured it was enough to know that you were stuck unbinding.

> The patch needs some 2024 -> 2025 updates in its copyright notices.

Will do.

> At the end of [1], I've been reminded of this quote from Andres about
> 0002:
> "This doesn't really seem like it's actually using wait events to
> describe waits. The new wait events cover stuff like memory
> allocations etc, see e.g. pg_SSPI_make_upn()."
>
> Should we do improvements in this area with interruptions before
> attempting to tackle the problem of making starting backend entries
> visible in pg_stat_activity, before these complete authentication? We
> use wait events for syscalls on I/O already, perhaps that does not
> stand as an argument in favor of this patch on its own, but I don't
> see why more events associated to external library calls as done in
> 0002 is a problem?

I had hoped that my v6 addressed Andres' concerns by pushing the
events down as far as possible (that way they can't be nested, which I
took to be the primary problem). Does something else need to be done?

Thanks,
--Jacob

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2025-02-10 16:31:20 Small memory fixes for pg_createsubcriber
Previous Message David G. Johnston 2025-02-10 16:21:48 Re: describe special values in GUC descriptions more consistently