Re: Auto-vacuum timing out and preventing connections

From: David Johansen <davejohansen(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Auto-vacuum timing out and preventing connections
Date: 2022-07-15 04:22:43
Message-ID: CAAcYxUcKYdWBRXUah_QNDTi4BfzsyPBw6KTC054PTWZeYLmNew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> > On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen(at)gmail(dot)com>
> > wrote:
> >
> > > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> wrote:
> > >
> > >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <
> davejohansen(at)gmail(dot)com>
> > >> wrote:
> > >>
> > >>> We're running into an issue where the database can't be connected
> to. It
> > >>> appears that the auto-vacuum is timing out and then that prevents new
> > >>> connections from happening. This assumption is based on these logs
> showing
> > >>> up in the logs:
> > >>> WARNING: worker took too long to start; canceled
> > >>> The log appears about every 5 minutes and eventually nothing can
> connect
> > >>> to it and it has to be rebooted.
> > >>>
> > >>
> > >> As Julien suggested, this sounds like another victim, not the cause.
> Is
> > >> there anything else in the log files?
> > >>
> > >
> > > That's the only thing in the logs for the 12-24 hours before the
> database
> > > becomes inaccessible.
> > >
> >
> > To follow up on this, this was the symptom and not the cause. The
> > auto-vacuum was failing to start because of a bug and not the cause of
> the
> > problem.
>
> What bug?
>

It appears to have been related to the scaling and process management that
Aurora Serverless V2 does. I haven't been able to find any info posted
about this issue from AWS, but we opened a support case and were told the
following:

We have identified a critical stability update for Aurora PostgreSQL
Serverless v2 instances running versions 13.6, 13.7, and 14.3. We have also
identified a critical issue in Aurora PostgreSQL Serverless v2 clusters
running versions 13.7 and 14.3. These issues can cause database restarts or
failovers under specific conditions. We have developed fixes and are
deploying the fixes in two patches. The patches will be automatically
applied to the affected instances and clusters in upcoming maintenance
windows over the next 3 weeks causing two restarts of your database. One
patch will show as a security and stability update and one patch will show
as a database update. They will be scheduled sequentially.

The symptoms we observed were slightly different than what is described
above, but we manually applied the patches as soon as they were available
and haven't noticed the problem since.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-07-15 04:55:32 Re: Excessive number of replication slots for 12->14 logical replication
Previous Message Andres Freund 2022-07-15 03:58:27 Re: [15] Custom WAL resource managers, single user mode, and recovery