Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject: Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date: 2025-03-13 17:56:42
Message-ID: rkej7nbahcoeaonn3dxdbk2wzsi2gi3l75mse7txzigleggk3c@egrki7isobok
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2025-03-13 10:29:49 -0700, Jacob Champion wrote:
> On Thu, Mar 13, 2025 at 9:56 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I am wondering if PAM is so fundamentally incompatible with handling
> > interrupts / a non-blocking interface that we have little choice but to
> > eventually remove it...
>
> Given the choice between a usually-working PAM module with known
> architectural flaws, and not having PAM at all, I think many users
> would rather continue using what's working for them.

authentication_timeout currently doesn't reliably work while in some auth
methods, nor does pg_terminate_backend() etc. That's IMO is rather bad from a
DOSability perspective.

The fact that some auth methods are broken like that has had a sizable
negative impact on postgres for a long time. Not just when those methods are
used, but also architecturally.

It's e.g. one of the main reasons we need the ugly escalating logic in
postmaster shutdowns to send SIGQUITs and then SIGKILL after a while, because
we don't have a reliable way of terminating backends normally. This used to
be way worse because historically postgres considered it sane (why, I have no
idea) to ereport() in timeout functions, which then occasionally lead to
backends stuck in malloc locks etc.

> > FWIW, I continue to think that it's better to invest in making more auth
> > methods non-blocking, rather than adding wait events for code that could maybe
> > sometimes wait on different things internally.
>
> I think we disagree on the either/or nature of that. If I can get
> proof that a certain thing is causing bugs in the wild, then I have
> ammunition to fix that thing.

FWIW, I've have repeatedly seen production issues due to authentication
timeout not working for some auth methods.

It's not hard to see why - e.g. a non-resonsive radius server just leaves the
backend hanging in select(). Even though it would get interrupted by signals,
we'll just retry without even checking interrupts / timeouts :(.

> Right now there is no visibility, and my interest in rewriting old
> authentication methods without bug reports to motivate that work is pretty
> low. I'm not willing to sign up for that at the moment.

Fair enough.

> (But I do really appreciate the review. I'm just feeling crispy about
> the overall result...)

Also fair enough :)

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2025-03-13 18:10:16 Available disk space per tablespace
Previous Message Jacob Champion 2025-03-13 17:29:49 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible