Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date: 2024-11-07 18:44:25
Message-ID: CAOYmi+kUkCkSV13-a0b0rwmtge_uZnPKQx-yLFQJdfoceWFH2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 7, 2024 at 10:12 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I don't understand why the pgstat_bestart()/pgstat_bestart_pre_auth() split
> makes sense. The latter is going to redo most of the work that the former
> did. What's the point of that?
>
> Why not have a new function that initializes just the missing additional
> information? Or for that matter, why not move most of what pgstat_bestart()
> does into pgstat_beinit()?

I talk about that up above [1]. I agree that this is all complicated
and fragile, but at the moment, I think splitting things apart is not
going to reduce the complexity in any way. I'm all ears for a
different approach, though (and it sounds like Michael is taking a
stab at it too).

> This doesn't really seem like it's actually using wait events to describe
> waits. The new wait events cover stuff like memory allocations etc, see
> e.g. pg_SSPI_make_upn().

I've also asked about the "scope" of the waits in the OP [2]. I can
move them downwards in the stack, if you'd prefer.

All of these are intended to cover parts of the code that can actually
hang, but for things like SSPI I'm just working off of inspection and
Win32 documentation. So if it's not actually true that some of these
call points can hang, let me know and I can remove them. (For the
particular example you called out, I'm just trying to cover both calls
to TranslateName() in a maintainable place. The documentation says
"TranslateName fails if it cannot bind to Active Directory on a domain
controller." which seemed pretty wait-worthy to me.)

> This isn't just pedantry - all the relevant code really needs to be rewritten
> to allow the blocking to happen in an interruptible way, otherwise
> authentication timeout etc can't realiably work. Once that's done you can
> actually define useful wait events too.

I agree that would be amazing! I'm not about to tackle reliable
interrupts for all of the current blocking auth code for v18, though.
I'm just trying to make it observable when we do something that
blocks.

--Jacob

[1] https://www.postgresql.org/message-id/CAOYmi%2BkLzSWrDHZbJg8bWZ94oP_K98mkoEvetgupOBVoy5H_ag%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAOYmi%2B%3D60deN20WDyCoHCiecgivJxr%3D98s7s7-C8SkXwrCfHXg%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-11-07 19:00:03 Re: magical eref alias names
Previous Message Tomas Vondra 2024-11-07 18:39:55 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4