From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Greg Smith <greg(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Spread checkpoint sync |
Date: | 2010-11-16 02:15:32 |
Message-ID: | AANLkTinDOt3RDzJk6_vX8pEF9d6=xkOt2eT_sYAfM+DA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> The second issue is that the delay between sync calls is currently
> hard-coded, at 3 seconds. I believe the right path here is to consider the
> current checkpoint_completion_target to still be valid, then work back from
> there. That raises the question of what percentage of the time writes
> should now be compressed into relative to that, to leave some time to spread
> the sync calls. If we're willing to say "writes finish in first 1/2 of
> target, syncs execute in second 1/2", that I could implement that here.
> Maybe that ratio needs to be another tunable. Still thinking about that
> part, and it's certainly open to community debate. The thing to realize
> that complicates the design is that the actual sync execution may take a
> considerable period of time. It's much more likely for that to happen than
> in the case of an individual write, as the current spread checkpoint does,
> because those are usually cached. In the spread sync case, it's easy for
> one slow sync to make the rest turn into ones that fire in quick succession,
> to make up for lost time.
I think the behavior of file systems and operating systems is highly
relevant here. We seem to have a theory that allowing a delay between
the write and the fsync should give the OS a chance to start writing
the data out, but do we have any evidence indicating whether and under
what circumstances that actually occurs? For example, if we knew that
it's important to wait at least 30 s but waiting 60 s is no better,
that would be useful information.
Another question I have is about how we're actually going to know when
any given fsync can be performed. For any given segment, there are a
certain number of pages A that are already dirty at the start of the
checkpoint. Then there are a certain number of additional pages B
that are going to be written out during the checkpoint. If it so
happens that B = 0, we can call fsync() at the beginning of the
checkpoint without losing anything (in fact, we gain something: any
pages dirtied by cleaning scans or backend writes during the
checkpoint won't need to hit the disk; and if the filesystem dumps
more of its cache than necessary on fsync, we may as well take that
hit before dirtying a bunch more stuff). But if B > 0, then we should
attempt the fsync() until we've written them all; otherwise we'll end
up having to fsync() that segment twice.
Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea. For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc. This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway. It's much better to dump the
cache at the beginning of the checkpoint and then again after every
file than it is to spew many GB of dirty stuff into the cache and then
drop the hammer.
I'm just brainstorming here; feel free to tell me I'm all wet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2010-11-16 02:28:00 | Re: unlogged tables |
Previous Message | Andy Colson | 2010-11-16 01:56:22 | unlogged tables |