Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date: 2025-02-06 08:35:02
Message-ID: Z6R0NuRGiY2_iNQ2@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 11, 2024 at 03:17:44PM -0800, Jacob Champion wrote:
> 1. pgstat_bestart_initial() reports STATE_STARTING, fills in the early
> fields and clears out the rest.
> 2. pgstat_bestart_security() reports the SSL/GSS status of the
> connection. This is only for backends with a valid MyProcPort; they
> call it twice.
> 3. pgstat_bestart_final() fills in the user and database IDs, takes
> the entry out of STATE_STARTING, and reports the application_name.

This split is interesting, neater than the previous approches taken.

> I was wondering if maybe I should fill in application_name before
> taking the entry out of STATE_STARTING, in order to establish the rule
> that "starting" pgstat entries are always partially filled, and that
> DBAs can rely on their full contents once they've proceeded past it.
> Thoughts?

/* Update app name to current GUC setting */
+ /* TODO: ask the list: maybe do this before setting STATE_UNDEFINED? */
if (application_name)
pgstat_report_appname(application_name);

Historically, we've always reported the application name after
STATE_UNDEFINED is set, never the reverse. There could be potential
security implications if we were to begin reporting the application
name before the connection has fully authenticated because it would be
possible for attempted connections to show malicious data in the
catalogs, and now we're sure that the catalog data is OK to query for
other users. Let's be careful about that. I think that we should
still set that as late as possible in pgstat_bestart_final(), never at
an earlier step as this data can come from a connection
string, potentially malicious if not authenticated yet.

> I've added machinery to 001_ssltests.pl to make sure we see early
> transport security stats prior to user authentication. This overlaps
> quite a bit with the new 007_pre_auth.pl, so if we'd rather not have
> the latter (as briefly discussed upthread) I can move the rest of it
> over.

+ if (!bootstrap)
+ {
+ pgstat_bestart_initial();
+ if (MyProcPort)
+ pgstat_bestart_security(); /* fill in any SSL/GSS info too */

This part of patch 0001 is giving me a very long pause. What's the
merit of filling in twice the backend entry data if we're going to
update it at the end anyway for normal backend connections? More info
for debugging, is it? It seems that removing this call does not
influence the tests you are adding, as well (or should the test
"pg_stat_ssl view is updated" be responsible for checking that?).
Perhaps this should be documented as a comment.

Anyway, allowing backends to call this routine multiple times makes me
quite uncomfortable, and it does not prevent looking at the wait event
data that may be reported with 0002, no? I think that we should only
call it once per backend and once authentication is completed, even
perhaps have a static rule to engrave this policy in stone.

With this patch, the information that we get to be able to debug a
backend entry in pg_stat_activity is st_clientaddr and
remote_hostname. If we have a backend stuck in a wait event for the
"Auth" class, would these be sufficient to provide a good user
experience? Still better than nothing as we don't know the database
and user ID used for the connection until authentication is done, I
guess, to be able to grab patterns behind authentication getting
stuck.

In patch 0002, WAIT_EVENT_SSPI_TRANSLATE_NAME is used twice for two
code paths. Perhaps these should be two separate wait events?
The events attached to ldap_unbind() make that also kind of hard to
debug. What's the code path actually impacted if we see them in
pg_stat_activity? We have nine of them with the same wait event
attached to them.

The patch needs some 2024 -> 2025 updates in its copyright notices.

At the end of [1], I've been reminded of this quote from Andres about
0002:
"This doesn't really seem like it's actually using wait events to
describe waits. The new wait events cover stuff like memory
allocations etc, see e.g. pg_SSPI_make_upn()."

Should we do improvements in this area with interruptions before
attempting to tackle the problem of making starting backend entries
visible in pg_stat_activity, before these complete authentication? We
use wait events for syscalls on I/O already, perhaps that does not
stand as an argument in favor of this patch on its own, but I don't
see why more events associated to external library calls as done in
0002 is a problem? Feel free to counter-argue, each code path where
an Auth event is added should be more closely looked at, for sure, but
I don't see why we should discard all of them based on this argument
if some are useful for debugging auth issues.

[1]: https://www.postgresql.org/message-id/faesruozcdcg2aksscb3dcojy4gqjqsyaxfajkqzm4kozjowgm@nmjgrrvdax25
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-02-06 08:39:25 Re: Show WAL write and fsync stats in pg_stat_io
Previous Message Yura Sokolov 2025-02-06 08:31:28 Re: Implement waiting for wal lsn replay: reloaded