From: | Jim Nasby <jim(at)nasby(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Trond Myklebust <trondmy(at)gmail(dot)com> |
Cc: | Bottomley James <James(dot)Bottomley(at)HansenPartnership(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Andres Freund <andres(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-15 04:01:39 |
Message-ID: | 52D60823.9040202@nasby.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 1/14/14, 10:08 AM, Tom Lane wrote:
> Trond Myklebust <trondmy(at)gmail(dot)com> writes:
>> On Jan 14, 2014, at 10:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> "Don't be aggressive" isn't good enough. The prohibition on early write
>>> has to be absolute, because writing a dirty page before we've done
>>> whatever else we need to do results in a corrupt database. It has to
>>> be treated like a write barrier.
>
>> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page cache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache.
>
> As things currently stand, we dirty the page in our internal buffers,
> and we don't write it to the kernel until we've written and fsync'd the
> WAL data that needs to get to disk first. The discussion here is about
> whether we could somehow avoid double-buffering between our internal
> buffers and the kernel page cache.
>
> I personally think there is no chance of using mmap for that; the
> semantics of mmap are pretty much dictated by POSIX and they don't work
> for this. However, disregarding the fact that the two communities
> speaking here don't control the POSIX spec, you could maybe imagine
> making it work if *both* pending WAL file contents and data file
> contents were mmap'd, and there were kernel APIs allowing us to say
> "you can write this mmap'd page if you want, but not till you've written
> that mmap'd data over there". That'd provide the necessary
> write-barrier semantics, and avoid the cache coherency question because
> all the data visible to the kernel could be thought of as the "current"
> filesystem contents, it just might not all have reached disk yet; which
> is the behavior of the kernel disk cache already.
>
> I'm dubious that this sketch is implementable with adequate efficiency,
> though, because in a live system the kernel would be forced to deal with
> a whole lot of active barrier restrictions. Within Postgres we can
> reduce write-ordering tests to a very simple comparison: don't write
> this page until WAL is flushed to disk at least as far as WAL sequence
> number XYZ. I think any kernel API would have to be a great deal more
> general and thus harder to optimize.
For the sake of completeness... it's theoretically silly that Postgres is doing all this stuff with WAL when the filesystem is doing something very similar with it's journal. And an SSD drive (and next generation spinning rust) is doing the same thing *again* in it's own journal.
If all 3 communities (or even just 2 of them!) could agree on the necessary interface a tremendous amount of this duplicated technology could be eliminated.
That said, I rather doubt the Postgres community would go this route, not so much because of the presumably massive changes needed, but more because our community is not a fan of restricting our users to things like "Thou shalt use a journaled FS or risk all thy data!"
> Another difficulty with merging our internal buffers with the kernel
> cache is that when we're in the process of applying a change to a page,
> there are intermediate states of the page data that should under no
> circumstances reach disk (eg, we might need to shuffle records around
> within the page). We can deal with that fairly easily right now by not
> issuing a write() while a page change is in progress. I don't see that
> it's even theoretically possible in an mmap'd world; there are no atomic
> updates to an mmap'd page that are larger than whatever is an atomic
> update for the CPU.
Yet another problem with trying to combine database and journaled FS efforts... :(
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2014-01-15 04:07:31 | Re: Linux kernel impact on PostgreSQL performance |
Previous Message | Jim Nasby | 2014-01-15 03:54:20 | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |