Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Date: 2016-11-22 20:59:12
Message-ID: CA+TgmoYb7mFYthxj9dJAjZbXu0gy6NeFLB8u83Ao26VrKGM6zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 22, 2016 at 3:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I agree. However, in many cases, the major cost of a fast shutdown is
>> getting the dirty data already in the operating system buffers down to
>> disk, not in writing out shared_buffers itself. The latter is
>> probably a single-digit number of gigabytes, or maybe double-digit.
>> The former might be a lot more, and the write of the pgstat file may
>> back up behind it. I've seen cases where an 8kB buffered write from
>> Postgres takes tens of seconds to complete because the OS buffer cache
>> is already saturated with dirty data, and the stats files could easily
>> be a lot more than that.
>
> I think this is mostly FUD, because we don't fsync the stats files. Maybe
> we should, but we don't today. So even if we have managed to get the
> system into a state where physical writes are heavily backlogged, that's
> not a reason to assume that the stats collector will be unable to do its
> thing promptly. All it has to do is push a relatively small amount of
> data into kernel buffers.

I don't believe that's automatically fast, if we're bumping up against
dirty_ratio. However, suppose you're right. Then what prompted the
original complaint? The OP said "The problem here is that postmaster
took as long as 15 seconds to terminate after it had detected a
crashed backend." It clearly WASN'T an indefinite hang as might have
occurred with the malloc-lock problem for which we implemented the
SIGKILL stuff. So something during shutdown took a long time, but not
forever. There's no convincing evidence I've seen that it has to have
been this particular thing, but I find it plausible, because normal
backends bail out without doing much of anything, and here we have a
process that is trying to continue doing work after having received
SIGQUIT. If not this, then what?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-22 21:15:58 Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Previous Message Tom Lane 2016-11-22 20:58:28 Re: [RFC] Should we fix postmaster to avoid slow shutdown?