Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Date: 2004-10-09 23:05:37
Message-ID: 4859.1097363137@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> mmap() is Right Out because it does not afford us sufficient control
>> over when changes to the in-memory data will propagate to disk.

> ... that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).

You're almost there. Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe. That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet. In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk. (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon. This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way. Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk. It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes. If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-10-09 23:13:45 Re: Security implications of config-file-location patch
Previous Message Bruce Momjian 2004-10-09 22:09:35 Re: [BUGS] BUG #1270: stack overflow in thread in fe_getauthname

Browse pgsql-performance by date

  From Date Subject
Next Message Gaetano Mendola 2004-10-10 09:19:59 kernel 2.6 synchronous directory
Previous Message Kevin Brown 2004-10-09 21:01:02 Re: First set of OSDL Shared Mem scalability results, some wierdness ...