Re: Refactoring postmaster's code to cleanup after child exit

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring postmaster's code to cleanup after child exit
Date: 2025-03-04 22:58:42
Message-ID: gdojfusnbe3ae47n6qjezclpv4462xbdc2ssoadtyklgdw5dqb@b4r5yw3fcifm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-10 12:00:12 +0200, Heikki Linnakangas wrote:
> On 09/12/2024 22:55, Heikki Linnakangas wrote:
> > Not sure how to fix this. A small sleep in the test would work, but in
> > principle there's no delay that's guaranteed to be enough. A more robust
> > solution would be to run a "select count(*) from pg_stat_activity" and
> > wait until the number of connections are what's expected. I'll try that
> > and see how complicated that gets..
>
> Checking pg_stat_activity doesn't help, because the backend doesn't register
> itself in pg_stat_activity until later. A connection that's rejected due to
> connection limits never shows up in pg_stat_activity.
>
> Some options:
>
> 0. Do nothing
>
> 1. Add a small sleep to the test
>
> 2. Move the pgstat_bestart() call earlier in the startup sequence, so that a
> backend shows up in pg_stat_activity before it acquires a PGPROC entry, and
> stays visible until after it has released its PGPROC entry. This would give
> more visibility to backends that are starting up.

We don't necessarily *have* a PGPROC entry for that backend when we run out of
connections, no?

> 3. Rearrange the FATAL error handling so that the process removes itself
> from PGPROC before sending the error to the client. That would be kind of
> nice anyway. Currently, if sending the rejection error message to the client
> blocks, you are holding up a PGPROC slot until the message is sent. The
> error message packet is short, so it's highly unlikely to block, but still.

This is definitely a problem, there was even a recent thread about it. It can
be triggered even with just an ERROR message though :(

For this test, could we perhaps rely on the log messages postmaster logs when
child processes exit?

2025-03-04 17:56:12.528 EST [3509838][not initialized][:0][[unknown]] LOG: connection received: host=[local]
2025-03-04 17:56:12.528 EST [3509838][client backend][:0][[unknown]] FATAL: sorry, too many clients already
2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG: releasing pm child slot 2
2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG: client backend (PID 3509838) exited with exit code 1

I.e. the test could wait for the 'client backend exited' message using
->wait_for_log()?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-03-04 23:04:16 Re: Next commitfest app release is planned for March 18th
Previous Message Andrew Dunstan 2025-03-04 22:51:59 Re: scalability bottlenecks with (many) partitions (and more)