From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Kevin Brown <kevin(at)sysexperts(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: First set of OSDL Shared Mem scalability results, some wierdness ... |
Date: | 2004-10-09 23:05:37 |
Message-ID: | 4859.1097363137@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> mmap() is Right Out because it does not afford us sufficient control
>> over when changes to the in-memory data will propagate to disk.
> ... that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).
You're almost there. Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe. That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet. In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk. (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)
mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon. This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.
There is another reason for doing things this way. Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk. It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes. If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2004-10-09 23:13:45 | Re: Security implications of config-file-location patch |
Previous Message | Bruce Momjian | 2004-10-09 22:09:35 | Re: [BUGS] BUG #1270: stack overflow in thread in fe_getauthname |
From | Date | Subject | |
---|---|---|---|
Next Message | Gaetano Mendola | 2004-10-10 09:19:59 | kernel 2.6 synchronous directory |
Previous Message | Kevin Brown | 2004-10-09 21:01:02 | Re: First set of OSDL Shared Mem scalability results, some wierdness ... |