Re: postgresql latency & bgwriter not doing its job

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-27 16:23:04
Message-ID: 53FE05E8.2010606@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/27/2014 04:20 PM, Andres Freund wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>>> I think a somewhat smarter version of the explicit flushes in the
>>> hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I've been thinking for a long time that we should interleave the writes
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at
once, causing a spike, but at least not more than that. Also, the
scheduling of a spread checkpoint is currently a bit bogus; we don't
take into account the time needed for the fsync phase.

A long time ago, Itagaki Takahiro wrote a patch sort the buffers and
write them out in order
(http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp)
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the
buffers for one file, then fsync it, then next file and so on. IIRC the
biggest worry with that patch was that sorting the buffers requires a
fairly large amount of memory, and making a large allocation in the
checkpointer might cause an out-of-memory, which would be bad.

I don't think anyone's seriously worked on this area since. If the
impact on responsiveness or performance is significant, I'm pretty sure
the OOM problem could be alleviated somehow.

For the kicks, I wrote a quick & dirty patch for interleaving the
fsyncs, see attached. It works by repeatedly scanning the buffer pool,
writing buffers belonging to a single relation segment at a time. I
would be interested to hear how this performs in your test case.

- Heikki

Attachment Content-Type Size
interleave-fsyncs-1.patch text/x-diff 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-08-27 16:41:15 Re: postgresql latency & bgwriter not doing its job
Previous Message Alvaro Herrera 2014-08-27 16:18:46 Re: SKIP LOCKED DATA (work in progress)