From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com> |
Cc: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: postgresql latency & bgwriter not doing its job |
Date: | 2014-08-27 16:23:04 |
Message-ID: | 53FE05E8.2010606@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 08/27/2014 04:20 PM, Andres Freund wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>>> I think a somewhat smarter version of the explicit flushes in the
>>> hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.
I've been thinking for a long time that we should interleave the writes
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at
once, causing a spike, but at least not more than that. Also, the
scheduling of a spread checkpoint is currently a bit bogus; we don't
take into account the time needed for the fsync phase.
A long time ago, Itagaki Takahiro wrote a patch sort the buffers and
write them out in order
(http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp)
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the
buffers for one file, then fsync it, then next file and so on. IIRC the
biggest worry with that patch was that sorting the buffers requires a
fairly large amount of memory, and making a large allocation in the
checkpointer might cause an out-of-memory, which would be bad.
I don't think anyone's seriously worked on this area since. If the
impact on responsiveness or performance is significant, I'm pretty sure
the OOM problem could be alleviated somehow.
For the kicks, I wrote a quick & dirty patch for interleaving the
fsyncs, see attached. It works by repeatedly scanning the buffer pool,
writing buffers belonging to a single relation segment at a time. I
would be interested to hear how this performs in your test case.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
interleave-fsyncs-1.patch | text/x-diff | 9.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-08-27 16:41:15 | Re: postgresql latency & bgwriter not doing its job |
Previous Message | Alvaro Herrera | 2014-08-27 16:18:46 | Re: SKIP LOCKED DATA (work in progress) |