Re: connection establishment versus parallel workers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: connection establishment versus parallel workers
Date: 2024-12-11 22:36:46
Message-ID: CA+hUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards. Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers. I've attached a set of reproduction
> steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this. I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion. TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

* Check if the latch is set already. If so, leave the loop
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help. Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-12-11 22:43:27 Re: connection establishment versus parallel workers
Previous Message Robert Haas 2024-12-11 21:45:41 Re: Assert failure on running a completed portal again