Re: Slow shutdowns sometimes on RDS Postgres

From: Chris Williams <cswilliams(at)gmail(dot)com>
To: schnjere(at)amazon(dot)com
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Slow shutdowns sometimes on RDS Postgres
Date: 2018-09-14 23:07:51
Message-ID: CAGj2PiQJAZXnVnsuBj5K-g1AfyzJo+S-krndu-jxRzrdXGJarg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey Jeremy,

Thanks so much for your response. That's a great tip regarding enabling
enhanced monitoring. Maybe it will give some insight into which particular
backends are causing the shutdown to hang. One interesting data point when
this happened is that in cloudwatch, once the database started shutting
down, you can see the connection count drop from 600 to 4 immediately.
Then the graph shows a constant 4 connections for 10 minutes straight
before it finally shuts down and goes to 0. I'm guessing the shutdown is
hanging because of one of these 4 connections. Unfortunately, because the
database is shutting down, I'm not able to connect and get any info about
these connections, so enabling enhanced monitoring might give some more
clues.

My other question is, do you think shutting down my apps manually once I
noticed the shutdown was hanging had any effect on the total shutdown time?
It seems a bit coincidental that the database finally shut down after
exactly 10 minutes. This makes me think that some internal timeout in
RDS's shutdown script was triggered and that shutting down my apps didn't
actually affect anything. I'd much rather just wait 10 minutes then
frantically try to shutdown all of my apps. So I'd be curious to know if
you are able to look up what the timeout is and if you think the timeout is
what is actually causing it to finally shut down. While I'd much rather
have my database shutdown in a minute or two, at least there would be some
comfort in knowing that 10 minutes is the upper bound.

I'd love to be able to figure out how to reproduce it, but it doesn't
happen consistently unfortunately and I've only ever seen it on our main
production database so that makes things trickier. We do need to resize
our database again in a few weeks. If there's any special debug settings i
should enable before modifying it (besides enhanced monitoring) or if
support wanted to observe it in the middle of the shutdown, let me know.
The last two times I've resized the database, the shutdown has hung like
this, so there's a good chance it will do it again.

Another idea I had too was to remove all the security group rules on the db
right before applying the modification. My thinking was that maybe that
might help postgres terminate all the connections quicker? That said, it's
all a shot in the dark I guess since we don't know the underlying cause.

Best,
Chris

On Fri, Sep 14, 2018 at 8:43 AM Jeremy Schneider <schnjere(at)amazon(dot)com>
wrote:

> Hi Chris - this is an interesting one that we do see from time to time;
> seems worth responding here as actually our best understanding right now
> is that this is something in community code, not AWS-specific.
>
>
> On 9/13/18 15:17, Joshua D. Drake wrote:
> > This is probably something you would have to talk to Amazon about. AWS
> > RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
> > administrative perspective.
>
> Hey JD! FWIW, the code differences are pretty minimal and generally just
> what's required to have a managed service where people can still use the
> database as they normally would. The biggest difference is just getting
> used to operating without direct OS access, and working through
> automation/tooling instead. (And there's always EC2 for the many
> customers who want/need superuser but still don't want to maintain the
> hardware.)
>
>
> On 9/13/18 16:10, Adrian Klaver wrote:
> > The thing is, what you are doing ("(e.g. reboot, changing instance
> > size, etc.)") are instance operations not database operations. That
> > comes under AWS's purview.
>
> Correct, managing reboots and hardware reconfigurations would be the
> responsibility of AWS. However Chris' issue here is just that PostgreSQL
> itself took a long time to shut down. I'm not aware of anything
> RDS-specific with this.
>
>
> > On 09/13/2018 03:04 PM, Chris Williams wrote:
> >> I'm using AWS RDS Postgres (9.6.6) and have run into very slow
> >> shutdowns (10+ minutes) a few times when making database modifications
> >> (e.g. reboot, changing instance size, etc.). Other times, it shuts
> >> down quickly (1 minute or so). I have not been able to figure out why
> >> sometimes it takes a long time to shutdown.
>
> I don't know about this specific incident, but I do know that the RDS
> team has seen cases where a backend gets into a state (like a system
> call) where it's not checking signals and thus doesn't receive or
> process the postmaster's request to quit. We've seen these processes
> delay shutdowns and also block recovery on streaming replicas.
>
>
> >> Once I start seeing these messages, I start manually shutting down all
> >> of our applications that are connected to the db. I'm not sure if
> >> shutting down the apps fixes it or if there's some timeout on the RDS
> >> side, but it seems like once I start doing this, the database finally
> >> shuts down.
>
> FYI, yes there is a timeout on the RDS side. The basic workflow is to
> try to shutdown postgres the normal way, and if it hasn't cleanly shut
> down after a period of time then forcefully kill it.
>
>
> >> When it takes this long to shut down, it ends up causing a lot more
> >> downtime than I would like. I've tried asking AWS's support why it
> >> takes so long to shutdown sometimes, but they basically just told me
> >> that's "how it works" and that I should try to shut down all of my
> >> connections ahead of time before making database modifications.
>
> Yeah, that's a really frustrating answer! I can think of a few
> additional things you can do. Obviously the holy grail here is a repro
> (as usual) -- I bet if you have a reliable repro then RDS support will
> be interested in taking a look at that. In fact if we can repro on RDS
> then we should be able to repro on community PostgreSQL just as easily,
> and could stack trace the non-responsive backend to see why it's stuck.
>
> Outside of a repro, you can enable enhanced monitoring on RDS which
> gives you a "top-like" process listing where you can at least see the
> basic state of the stuck backend, much like you would with top/ps on
> community postgres. Once you upgrade to PostgreSQL 10 on RDS you can
> also enable Performance Insights which will show you the SQL, if the
> backend happens to be active. I haven't gone real deep on pgSentinel
> yet, but it provides similar functionality on community PostgreSQL.
>
>
> >> We just have a few ruby on rails applications connected to the
> >> database, and don't really have any long running or heavy queries and
> >> the db is under very light load, so I don't understand why it takes so
> >> long to shutdown. We do have a sizeable number of connections though
> >> (about 600) and there are two replicas connected to it. I also tried
> >> setting idle_in_transaction_session_timeout to 300 seconds to see if
> >> that would help, but it made no difference.
>
> Good data point - the fact that you set the idle timeout so low
> reinforces my suspicion that there was a non-responsive backend involved
> here.
>
>
> >> I was wondering if anyone else had seen this behavior on their RDS
> >> Postgres instances or had any suggestions on how I could shorten the
> >> shutdown time?
>
> So yeah, it's not common... but you're not alone. If you're able to
> narrow it down then please share anything you learn!
>
> Hope this helps,
> Jeremy
>
>
> --
> Jeremy Schneider
> Database Engineer
> Amazon Web Services
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martín Fernández 2018-09-14 23:15:43 Vacuum not deleting tuples when lockless
Previous Message Tom Lane 2018-09-14 22:29:27 Re: column information from view