From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Josh Berkus <josh(at)agliodbs(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Redesigning checkpoint_segments |
Date: | 2013-06-06 08:41:26 |
Message-ID: | 51B04B36.5000708@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 05.06.2013 23:16, Josh Berkus wrote:
>> For limiting the time required to recover after crash,
>> checkpoint_segments is awkward because it's difficult to calculate how
>> long recovery will take, given checkpoint_segments=X. A bulk load can
>> use up segments really fast, and recovery will be fast, while segments
>> full of random deletions can need a lot of random I/O to replay, and
>> take a long time. IMO checkpoint_timeout is a much better way to control
>> that, although it's not perfect either.
>
> This is true, but I don't see that your proposal changes this at all
> (for the better or for the worse).
Right, it doesn't. I explained this to justify that it's OK to replace
checkpoint_segments with max_wal_size. If someone is trying to use
checkpoint_segments to limit the time required to recover after crash,
he might find the current checkpoint_segments setting more intuitive
than my proposed max_wal_size. checkpoint_segments means "perform a
checkpoint every X segments", so you know that after a crash, you will
have to replay at most X segments (except that
checkpoint_completion_target complicates that already). With
max_wal_size, the relationship is not as clear.
What I tried to argue is that I don't think that's a serious concern.
>> I propose that we do something similar, but not exactly the same. Let's
>> have a setting, max_wal_size, to control the max. disk space reserved
>> for WAL. Once that's reached (or you get close enough, so that there are
>> still some segments left to consume while the checkpoint runs), a
>> checkpoint is triggered.
>
> Refinement of the proposal:
>
> 1. max_wal_size is a hard limit
I'd like to punt on that until later. Making it a hard limit would be a
much bigger patch, and needs a lot of discussion how it should behave
(switch to read-only mode, progressively slow down WAL writes, or what?)
and how to implement it.
But I think there's a clear evolution path here; with current
checkpoint_segments, it's not sensible to treat that as a hard limit.
Once we have something like max_wal_size, defined in MB, it's much more
sensible. So turning it into a hard limit could be a follow-up patch, if
someone wants to step up to the plate.
> 2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
> to avoid lockup if checkpoint takes longer than expected.
Will also have to factor in checkpoint_completion_target.
>> Hmm, haven't thought about that. I think a better unit to set
>> wal_keep_segments in would also be MB, not segments.
>
> Well, the ideal unit from the user's point of view is *time*, not space.
> That is, the user wants the master to keep, say, "8 hours of
> transaction logs", not any amount of MB. I don't want to complicate
> this proposal by trying to deliver that, though.
OTOH, if you specify it in terms of time, then you don't have any limit
on the amount of disk space required.
>> In this proposal, the number of segments preallocated is controlled
>> separately from max_wal_size, so that you can set max_wal_size high,
>> without actually consuming that much space in normal operation. It's
>> just a backstop, to avoid completely filling the disk, if there's a
>> sudden burst of activity. The number of segments preallocated is
>> auto-tuned, based on the number of segments used in previous checkpoint
>> cycles.
>
> "based on"; can you give me your algorithmic thinking here? I'm
> thinking we should have some calculation of last cycle size and peak
> cycle size so that bursty workloads aren't compromised.
Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:
> /*
> * Track a moving average of recent buffer allocations. Here, rather than
> * a true average we want a fast-attack, slow-decline behavior: we
> * immediately follow any increase.
> */
> if (smoothed_alloc <= (float) recent_alloc)
> smoothed_alloc = recent_alloc;
> else
> smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
> smoothing_samples;
>
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Joshua D. Drake | 2013-06-06 08:42:42 | Re: Redesigning checkpoint_segments |
Previous Message | Andres Freund | 2013-06-06 08:33:34 | Re: MVCC catalog access |