From: | Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com> |
---|---|
To: | Greg Smith <greg(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Spread checkpoint sync |
Date: | 2011-02-07 15:22:15 |
Message-ID: | AANLkTimvJP5SnS6u826BUt2=NmwM+43AQQgnnKwY_yWg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2011/2/7 Greg Smith <greg(at)2ndquadrant(dot)com>:
> Robert Haas wrote:
>>
>> With the fsync queue compaction patch applied, I think most of this is
>> now not needed. Attached please find an attempt to isolate the
>> portion that looks like it might still be useful. The basic idea of
>> what remains here is to make the background writer still do its normal
>> stuff even when it's checkpointing. In particular, with this patch
>> applied, PG will:
>>
>> 1. Absorb fsync requests a lot more often during the sync phase.
>> 2. Still try to run the cleaning scan during the sync phase.
>> 3. Pause for 3 seconds after every fsync.
>>
>
> Yes, the bits you extracted were the remaining useful parts from the
> original patch. Today was quiet here because there were sports on or
> something, and I added full auto-tuning magic to the attached update. I
> need to kick off benchmarks and report back tomorrow to see how well this
> does, but any additional patch here would only be code cleanup on the messy
> stuff I did in here (plus proper implementation of the pair of GUCs). This
> has finally gotten to the exact logic I've been meaning to complete as
> spread sync since the idea was first postponed in 8.3, with the benefit of
> some fsync aborption improvements along the way too
>
> The automatic timing is modeled on the existing checkpoint_completion_target
> concept, except with a new tunable (not yet added as a GUC) currently called
> CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is
> make the existing checkpoint_completion_target now be the target for the end
> of the sync phase, matching its name; people who bumped it up won't
> necessarily even have to change anything. Then the new guc can be
> checkpoint_write_target, representing the target that is in there right now.
Is it worth a new thread with the different IO improvements done so
far or on-going and how we may add new GUC(if required !!!) with
intelligence between those patches ? ( For instance, hint bit IO limit
needs probably a tunable to define something similar to
hint_write_completion_target and/or IO_throttling strategy, ...items
which are still in gestation...)
>
> I tossed the earlier idea of counting relations to sync based on the write
> phase data as too inaccurate after testing, and with it for now goes
> checkpoint sorting. Instead, I just take a first pass over pendingOpsTable
> to get a total number of things to sync, which will always match the real
> count barring strange circumstances (like dropping a table).
>
> As for the automatically determining the interval, I take the number of
> syncs that have finished so far, divide by the total, and get a number
> between 0.0 and 1.0 that represents progress on the sync phase. I then use
> the same basic CheckpointWriteDelay logic that is there for spreading writes
> out, except with the new sync target. I realized that if we assume the
> checkpoint writes should have finished in CheckPointCompletionTarget worth
> of time or segments, we can compute a new progress metric with the formula:
>
> progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) *
> finished / goal;
>
> Where "finished" is the number of segments written out, while "goal" is the
> total. To turn this into an example, let's say the default parameters are
> set, we've finished the writes, and finished 1 out of 4 syncs; that much
> work will be considered:
>
> progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625
>
> On a scale that effectively aimes to be finished sync work by 0.8.
>
> I don't use quite the same logic as the CheckpointWriteDelay though. It
> turns out the existing checkpoint_completion implementation doesn't always
> work like I thought it did, which provide some very interesting insight into
> why my attempts to work around checkpoint problems haven't worked as well as
> expected the last few years. I thought that what it did was wait until an
> amount of time determined by the target was reached until it did the next
> write. That's not quite it; what it actually does is check progress against
> the target, then sleep exactly one nap interval if it is is ahead of
> schedule. That is only the same thing if you have a lot of buffers to write
> relative to the amount of time involved. There's some alternative logic if
> you don't have bgwriter_lru_maxpages set, but in the normal situation it
> effectively it means that:
>
> maximum write spread time=bgwriter_delay * checkpoint dirty blocks
>
> No matter how far apart you try to spread the checkpoints. Now, typically,
> when people run into these checkpoint spikes in production, reducing
> shared_buffers improves that. But I now realize that doing so will then
> reduce the average number of dirty blocks participating in the checkpoint,
> and therefore potentially pull the spread down at the same time! Also, if
> you try and tune bgwriter_delay down to get better background cleaning,
> you're also reducing the maximum spread. Between this issue and the bad
> behavior when the fsync queue fills, no wonder this has been so hard to tune
> out of production systems. At some point, the reduction in spread defeats
> further attempts to reduce the size of what's written at checkpoint time, by
> lowering the amount of data involved.
interesting!
>
> What I do instead is nap until just after the planned schedule, then execute
> the sync. What ends up happening then is that there can be a long pause
> between the end of the write phase and when syncs start to happen, which I
> consider a good thing. Gives the kernel a little more time to try and get
> writes moving out to disk.
Sounds like a really good idea like that.
> Here's what that looks like on my development
> desktop:
>
> 2011-02-07 00:46:24 EST: LOG: checkpoint starting: time
> 2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10
> 2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99
> 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1
> file=base/16736/16749.1 time=12033.898 msec
> 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2
> file=base/16736/16749 time=60.799 msec
> 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59
> 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3
> file=base/16736/16756 time=0.003 msec
> 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4
> file=base/16736/16750 time=0.003 msec
> 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5
> file=base/16736/16737 time=0.004 msec
> 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6
> file=base/16736/16749_fsm time=0.004 msec
> 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7
> file=base/16736/16740 time=0.003 msec
> 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8
> file=base/16736/16749_vm time=0.003 msec
> 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9
> file=base/16736/16752 time=0.003 msec
> 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60
> 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10
> file=base/16736/16754 time=0.003 msec
> 2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers
> (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
> write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
> longest=12.033 s, average=1.209 s
>
> Since this is ext3 the spike during the first sync is brutal, anyway, but it
> tried very hard to avoid that: it waited 99 * 200ms = 19.8 seconds between
> writing the last buffer and when it started syncing them (00:42:04 to
> 00:48:24). Given the slow write for #1, it was then behind, so it
> immediately moved onto #2. But after that, it was able to insert a moderate
> nap time between successive syncs--60 naps is 12 seconds, and it keeps that
> pace for the remainder of the sync. This is the same sort of thing I'd
> worked out as optimal on the system this patch originated from, except it
> had a lot more dirty relations; that's why its naptime was the 3 seconds
> hard-coded into earlier versions of this patch.
>
> Results on XFS with mini-server class hardware should be interesting...
>
> --
> Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
From | Date | Subject | |
---|---|---|---|
Next Message | Shigeru HANADA | 2011-02-07 15:30:30 | Re: SQL/MED - file_fdw |
Previous Message | Tom Lane | 2011-02-07 15:22:01 | Re: A different approach to extension NO USER DATA feature |