Re: Refactoring postmaster's code to cleanup after child exit

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring postmaster's code to cleanup after child exit
Date: 2024-10-04 22:03:41
Message-ID: CA+hUKGLLtH0ZT3+7i9xJzY8wufmqA7y4q9uC2TsGcBHBuRX4sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> My test for dead-end backends opens 20 TCP (or unix domain) connections
> to the server, in quick succession. That works fine my system, and it
> passed cirrus CI on other platforms, but on FreeBSD it failed
> repeatedly. The behavior in that scenario is apparently
> platform-dependent: it depends on the accept queue size, but what
> happens when you reach the queue size also seems to depend on the
> platform. On my Linux system, the connect() calls in the client are
> blocked, if the server is doesn't call accept() fast enough, but
> apparently you get an error on *BSD systems.

Right, we've analysed that difference in AF_UNIX implementation
before[1], which shows up in the real world, where client sockets ie
libpq's are usually non-blocking, as EAGAIN on Linux (which is not
valid per POSIX) vs ECONNREFUSED on other OSes. All fail to connect,
but the error message is different.

For blocking AF_UNIX client sockets like in your test, Linux
effectively has an infinite queue made from two layers. The listen
queue (a queue of connecting sockets) does respect the requested
backlog size, but when it's full it has an extra trick: the connect()
call waits (in a queue of threads) for space to become free in the
listen queue, so it's effectively unlimited (but only for blocking
sockets), while FreeBSD and I suspect any other implementation
deriving from or reimplementing the BSD socket code gives you
ECONNREFUSED. macOS behaves just the same as FreeBSD AFAICT, so I
don't know why you didn't see the same thing... I guess it's just
racing against accept() draining the queue.

It's possible that Windows copied the Linux behaviour for AF_UNIX,
given that it probably has something to do with the WSL project for
emulating Linux, but IDK.

[1] https://www.postgresql.org/message-id/flat/CADc_NKg2d%2BoZY9mg4DdQdoUcGzN2kOYXBu-3--RW_hEe0tUV%3Dg%40mail.gmail.com

> I'm not sure of the exact details, but in any case, platform-dependent
> behavior needs to be avoided in tests. So I changed the test so that it
> sends an SSLRequest packet on each connection and waits for reply (which
> is always 'N' to reject it in this test), before opening the next
> connection. This way, each connection is still left hanging, which is
> what I want in this test, but only after postmaster has successfully
> accept()ed it and forked the backend.

Makes sense.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-10-04 22:29:55 Re: POC, WIP: OR-clause support for indexes
Previous Message Jelte Fennema-Nio 2024-10-04 21:05:51 Re: Extension security improvement: Add support for extensions with an owned schema