Re: connection establishment versus parallel workers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: connection establishment versus parallel workers
Date: 2025-01-13 20:42:00
Message-ID: CA+hUKG+_-34Qo2pPpSwdbZ8GK4c5Gc2WyxXKivowk=XuCD4+sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:
> > On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> >> 0001 patch is unchanged, 0002 patch sketches out a response to the
> >> observation a couple of paragraphs above.
> >
> > Both of these patches seem to improve matters quite a bit. I haven't yet
> > thought too deeply about it all, but upon a skim, your patches seem
> > entirely reasonable to me.
>
> I gave these a closer look, and I still feel that they are both
> straightforward and reasonable. IIUC the main open question is whether
> this might cause problems for other PM signal kinds. Like you, I don't see
> anything immediately obvious there, but I'll admit I'm not terribly
> familiar with the precise characteristics of postmaster signals. In any
> case, 0001 feels pretty safe to me.

Cool. Thanks. I'll think about what else could be affected by that
change as you say, and if nothing jumps out I'll go ahead and commit
them, back to 16.

I have done a lot more study of this problem and was about to write in
with some more patches to propose for master only. Basically that
"100" is destroying performance in this workload, which at least on my
machine hardly gets any parallelism at all, and only in sporadic
bursts. You can argue that we aren't designed for high frequency
short-lived workers (we'll have to reuse workers in some way to be
good at that), but I don't think it has to fail as badly as it does
today. It falls off a cliff instead of plateauing: we are so busy
forking that we don't get around to reaping children, so all our slots
are (artificially) used up most of the time, and the queries that do
manage to nab one then sit on their hands for a long time at query
end. "1" gets much smoother results, but as prophesied in aa1351f1,
the complexity is terrible, possibly even O(n^3) in places depending
on how you count: there are many places that scan the whole worker
list, and one that even scans it again for each item, and that is for
each thing that starts. IOW we have to fix the complexity
fundamentally. I have a WIP patch that adds a couple of work queues,
so that the postmaster never has to consider anything more than the
head of a queue in various places. More soon...

> > However, while this makes the test numbers for >= v16 look more like those
> > for v15, we're also seeing a big jump from v13 to v14. This bisects pretty
> > cleanly to commit d872510. I haven't figured out _why_ this commit is
> > impacting this particular test, but I figured I'd at least update the
> > thread with what we know so far.
>
> I regrettably have no updates on this one, yet.

My first thought was that the catalogues needed for connection might
be getting evicted, but the data size seems too small for that surely
and you'd probably have picked it up immediately from wait events.
Weird.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-01-13 20:43:46 Re: AIO v2.2
Previous Message Matthias van de Meent 2025-01-13 20:39:40 Re: InitControlFile misbehaving on graviton