Quick Links

Re: Redesigning checkpoint_segments

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Redesigning checkpoint_segments
Date:	2013-06-06 08:41:26
Message-ID:	51B04B36.5000708@vmware.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 05.06.2013 23:16, Josh Berkus wrote:
>> For limiting the time required to recover after crash,
>> checkpoint_segments is awkward because it's difficult to calculate how
>> long recovery will take, given checkpoint_segments=X. A bulk load can
>> use up segments really fast, and recovery will be fast, while segments
>> full of random deletions can need a lot of random I/O to replay, and
>> take a long time. IMO checkpoint_timeout is a much better way to control
>> that, although it's not perfect either.
>
> This is true, but I don't see that your proposal changes this at all
> (for the better or for the worse).

Right, it doesn't. I explained this to justify that it's OK to replace
checkpoint_segments with max_wal_size. If someone is trying to use
checkpoint_segments to limit the time required to recover after crash,
he might find the current checkpoint_segments setting more intuitive
than my proposed max_wal_size. checkpoint_segments means "perform a
checkpoint every X segments", so you know that after a crash, you will
have to replay at most X segments (except that
checkpoint_completion_target complicates that already). With
max_wal_size, the relationship is not as clear.

What I tried to argue is that I don't think that's a serious concern.

>> I propose that we do something similar, but not exactly the same. Let's
>> have a setting, max_wal_size, to control the max. disk space reserved
>> for WAL. Once that's reached (or you get close enough, so that there are
>> still some segments left to consume while the checkpoint runs), a
>> checkpoint is triggered.
>
> Refinement of the proposal:
>
> 1. max_wal_size is a hard limit

I'd like to punt on that until later. Making it a hard limit would be a
much bigger patch, and needs a lot of discussion how it should behave
(switch to read-only mode, progressively slow down WAL writes, or what?)
and how to implement it.

But I think there's a clear evolution path here; with current
checkpoint_segments, it's not sensible to treat that as a hard limit.
Once we have something like max_wal_size, defined in MB, it's much more
sensible. So turning it into a hard limit could be a follow-up patch, if
someone wants to step up to the plate.

> 2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
> to avoid lockup if checkpoint takes longer than expected.

Will also have to factor in checkpoint_completion_target.

>> Hmm, haven't thought about that. I think a better unit to set
>> wal_keep_segments in would also be MB, not segments.
>
> Well, the ideal unit from the user's point of view is *time*, not space.
> That is, the user wants the master to keep, say, "8 hours of
> transaction logs", not any amount of MB. I don't want to complicate
> this proposal by trying to deliver that, though.

OTOH, if you specify it in terms of time, then you don't have any limit
on the amount of disk space required.

>> In this proposal, the number of segments preallocated is controlled
>> separately from max_wal_size, so that you can set max_wal_size high,
>> without actually consuming that much space in normal operation. It's
>> just a backstop, to avoid completely filling the disk, if there's a
>> sudden burst of activity. The number of segments preallocated is
>> auto-tuned, based on the number of segments used in previous checkpoint
>> cycles.
>
> "based on"; can you give me your algorithmic thinking here? I'm
> thinking we should have some calculation of last cycle size and peak
> cycle size so that bursty workloads aren't compromised.

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

> /*
> * Track a moving average of recent buffer allocations. Here, rather than
> * a true average we want a fast-attack, slow-decline behavior: we
> * immediately follow any increase.
> */
> if (smoothed_alloc <= (float) recent_alloc)
> smoothed_alloc = recent_alloc;
> else
> smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
> smoothing_samples;
>

- Heikki

In response to

Re: Redesigning checkpoint_segments at 2013-06-05 20:16:05 from Josh Berkus

Responses

Re: Redesigning checkpoint_segments at 2013-06-07 02:43:31 from Greg Smith

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua D. Drake	2013-06-06 08:42:42	Re: Redesigning checkpoint_segments
Previous Message	Andres Freund	2013-06-06 08:33:34	Re: MVCC catalog access