Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-10 16:06:35
Message-ID: 20160210160635.vm26bxzxqpogxbbs@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-02-09 22:27:07 -0500, Tom Lane wrote:
> The idea I was toying with is that previous filesystem activity (making
> the temp install, the server's never-fsync'd writes, etc) has built up a
> bunch of dirty kernel buffers, and at some point the kernel goes nuts
> writing all that data. So the issues we're seeing would come and go
> depending on the timing of that I/O spike. I'm not sure how to prove
> such a theory from here.

It'd be interesting to monitor
$ grep -E '^(Dirty|Writeback):' /proc/meminfo
output. At least on linux. It's terribly easy to get the kernel into a
state where it has so much data needing to be written back that an
immediate checkpoint takes pretty much forever.

If I understand the code correctly, once a buffer has been placed into
'writeback', it'll be more-or-less processed in order. That can e.g. be
because these buffers have been written to more than 30s ago. If there
then are buffers later that also need to be written back (e.g. due to an
fsync()), you'll often wait for the earlier ones.

Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2016-02-10 16:07:23 Re: Updated backup APIs for non-exclusive backups
Previous Message David Steele 2016-02-10 16:06:01 Re: Updated backup APIs for non-exclusive backups