Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date: 2025-03-07 18:28:16
Message-ID: CAOYmi+nkEPh4724NKci7hFiZBbu4WEO99bL7=fu+EvTAb3N8uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 7, 2025 at 9:25 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I should have clarified - there are a few that I think are ok, basically the
> places where we wrap syscalls, e.g. around the sendto, select and recvfrom in
> PerformRadiusTransaction().

Okay.

> OTOH that code is effectively completely broken. Doing a blocking select() is
> just a no-go, the code isn't interruptible, breaking authentication
> timeout. And using select() means that we theoretically could crash due to an
> fd that's above FD_SETSIZE.

I think we're in agreement here; I'm just trying to improve things
incrementally. If someone actually hits the broken case, I think it'd
be helpful for them to see it.

> I think some of the wrapped calls into library code might actually call back
> into our code (to receive/send data), and our code then will use wait events
> around lower level operations done as part of that.

That would be a problem, agreed, but I didn't think I'd wrapped any
callback APIs. (Admittedly I have little experience with the SSPI
stuff.) But looking at the wrapped calls in the patch... which are you
suspicious of?

> It's also IMO quite wrong to do something that can throw an error inside a
> wait event, because that means that the wait event will still be reported
> during error recovery.

Hm, okay. I can change that for the LookupAccountSid case.

> Probably not the only place doing so, but it's still
> wrong.

It's definitely not the only place. :D

> Why stop with
> these functions and not just do that for *all* functions in postgres? I mean
> it'd not work and slow everything down,

(That seems like a good reason not to do it for all functions in
Postgres, no? I hope the slope is not all that slippery in practice.)

> but how do you define that line?

Cost/benefit. In this case, authentication hanging in an unknown place
in PAM and LDAP has caused tangible support problems. I suspect I'd
have gotten complaints if I only focused on those two places, though,
so I expanded it to the other blocking calls I could see.

Thanks,
--Jacob

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2025-03-07 18:41:35 Re: Statistics Import and Export
Previous Message Tom Lane 2025-03-07 18:27:30 Re: Add column name to error description