Re: PMChildFlags array

From: bhargav kamineni <bhargavpostgres(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: PMChildFlags array
Date: 2019-10-05 17:26:27
Message-ID: CADCf-WPf2C-c6UNUNK_U-OrFnx-fhcET9XOC58mn31CkQ_QTjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks Tom Lane for detailing the issue.
>So ... how many is "a bunch"?
more than 85

>Looking at the code, it seems like it'd be possible for a sufficiently
>aggressive spawner of incoming connections to reach the
>MaxLivePostmasterChildren limit. While the postmaster would correctly
>reject additional connection attempts after that, what it would not do
>is ensure that any child slots are left for new parallel worker processes.
>So we could hypothesize that the error you're seeing in the log is from
>failure to spawn a parallel worker process, due to being out of child
>slots.
Thanks Tom Lane for detailing the issue.

we have enabled "max_parallel_workers_per_gather = 4". 20 days before we
ran into this issue .

>However, given that max_connections = 500, MaxLivePostmasterChildren()
>would be 1000-plus. This would mean that reaching this condition would
>require *at least* 500 concurrent connection-attempts-that-haven't-yet-
>been-rejected, maybe well more than that if you didn't have close to
>500 legitimately open sessions. That seems like a lot, enough to suggest
>that you've got some pretty serious bug in your client-side logic.

below errors observed after crash in postgres logfile :

ERROR: xlog flush request is not satisfied for couple of tables , we have
initiated the vacuum full on those tables and the error went off after that.
ERROR: right sibling's left-link doesn't match: block 273660 links to
273500 instead of expected 273661 in index -- observed this error while
doing vacuum freeze on databsase , we have dropped this index and created a
new one

Observations :

Vacuum freeze analyze job is getting stuck at database end which is
initiated thru cronjob, pg_cancel_backend(), pg_termiante_backend() is not
able to terminate those stuck process , Restarting the database only able
to clear those process , i am thinking this is happening due to corruption
(if this is true how can i detect this ? pg_dump ?). is there any way to
overcome this problem ?

does migrating the database to a new instance (pg_basebackup and switching
over to new instance ) solves this issue ?

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

On Fri, 4 Oct 2019 at 03:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> bhargav kamineni <bhargavpostgres(at)gmail(dot)com> writes:
> >> What was the database doing just before the FATAL line?
>
> > Postgres was rejecting a bunch of connections from a user who is having a
> > connection limit set. that was the the FATAL error that i could see in
> log
> > file.
> > FATAL,53300,"too many connections for role ""user_app"""
>
> So ... how many is "a bunch"?
>
> Looking at the code, it seems like it'd be possible for a sufficiently
> aggressive spawner of incoming connections to reach the
> MaxLivePostmasterChildren limit. While the postmaster would correctly
> reject additional connection attempts after that, what it would not do
> is ensure that any child slots are left for new parallel worker processes.
> So we could hypothesize that the error you're seeing in the log is from
> failure to spawn a parallel worker process, due to being out of child
> slots.
>
> However, given that max_connections = 500, MaxLivePostmasterChildren()
> would be 1000-plus. This would mean that reaching this condition would
> require *at least* 500 concurrent connection-attempts-that-haven't-yet-
> been-rejected, maybe well more than that if you didn't have close to
> 500 legitimately open sessions. That seems like a lot, enough to suggest
> that you've got some pretty serious bug in your client-side logic.
>
> Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
> number of acceptable connections is identical to the number of allowed
> child processes; it needs to be less, by the number of background
> processes we want to support. But it seems like a darn hard-to-hit bug,
> so I'm not quite sure that that explains your observation.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message PegoraroF10 2019-10-05 17:37:10 Re: Performance on JSONB select
Previous Message Fabrízio de Royes Mello 2019-10-05 15:00:05 Re: Performance on JSONB select