From: | Jim Nasby <jim(at)nasby(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu> |
Cc: | Claudio Freire <klaussfreire(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Why we are going to have to go DirectIO |
Date: | 2013-12-08 21:13:25 |
Message-ID: | 52A4E0F5.1090008@nasby.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/5/13 9:59 AM, Tom Lane wrote:
> Greg Stark <stark(at)mit(dot)edu> writes:
>> I think the way to use mmap would be to mmap very large chunks,
>> possibly whole tables. We would need some way to control page flushes
>> that doesn't involve splitting mappings and can be efficiently
>> controlled without having the kernel storing arbitrarily large tags on
>> page tables or searching through all the page tables to mark pages
>> flushable.
>
> I might be missing something, but AFAICS mmap's API is just fundamentally
> wrong for this. The kernel is allowed to write-back a modified mmap'd
> page to the underlying file at any time, and will do so if say it's under
> memory pressure. You can tell the kernel to sync now, but you can't tell
> it *not* to sync. I suppose you are thinking that some wart could be
> grafted onto that API to reverse that, but I wouldn't have a lot of
> confidence in it. Any VM bug that caused the kernel to sometimes write
> too soon would result in nigh unfindable data consistency hazards.
Something else to ponder on... a Segate researcher gave a talk on upcoming hard drive technology it RICON East this spring. The interesting bit is that 1 or 2 generations down the road HDs will start using "shingling": The write head has to be bigger than the read head, so they're going to set it up so you can not modify a range of tracks after they've been written. They'll do this by keeping a journal inside the HD. This is somewhat similar to how SSDs work too (you can only erase large pages of data, you can't update individual bytes/sectors/filesystem blocks.
So long-term, random access updates to permanent storage will be less efficient than today. (Of course, non-volatile memory could turn all this on it's head..)
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2013-12-08 21:15:09 | Re: ANALYZE sampling is too good |
Previous Message | MauMau | 2013-12-08 21:08:18 | Re: Re: [RFC] Shouldn't we remove annoying FATAL messages from server log? |