Re: Strange checkpoint behavior - checkpoints take a long time

From: Igor Polishchuk <ora4dba(at)gmail(dot)com>
To: Vladimir Nicolici <vladnc(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Strange checkpoint behavior - checkpoints take a long time
Date: 2017-10-05 23:56:34
Message-ID: F54CFC93-29CF-4F87-9A76-B2E398AFBC2F@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Vladimir,
Just curious, if your goal is to reduce checkpoint overhead, shouldn’t you decrease shared_buffers instead of increasing it?
With bigger shared_buffers, you can accumulate more dirty buffers for checkpoint to take care.
I remember in early versions ( around 8.4), when checkpoint_completion_target was not available, one suggested way of fighting heavy checkpoints was
setting very low shared_buffers.
Also, why do yo need to reduce your checkpoint write time to 20 minutes from 25? What will you gain? If you will have the same number of dirty buffers to flush, your IO intensity and overhead will increase in these 20 minutes.

Igor


> On Oct 5, 2017, at 12:58, Vladimir Nicolici <vladnc(at)gmail(dot)com> wrote:
>
> Some further updates about the issue.
>
> I did a bit of benchmarking on the disk system with iozone, and the during the test the SSDs seemed to be able to easily sustain 200 MB/second of writes each, they fluctuated between 200 MB/s and 400 MB/s when doing 96 GB of random writes in a file. That would mean between 400 and 800 MB/s for the entire RAID volume, since it’s 1+0 and has 4 SSDs, 2 in each mirror.
>
> I wasn’t able to benchmark reads properly because the machine has so much RAM that they work mostly from cache. But reads shouldn’t be an issue anyway. I’ll try to do more comprehensive tests tomorrow but, from what I’ve seen so far, SSD I/O limits shouldn’t be a concern.
>
> I changed some configuration parameters during the night to the values I was considering yesterday:
>
> shared_buffers = 144GB #previously 96 GB
> bgwriter_lru_maxpages = 100 #previously 400
> checkpoint_timeout = 30min #previously 5min
> checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.
> max_wal_size = 96GB #previously 16GB
> wal_buffers = 16MB #previously 32 MB
>
> With the new settings the checkpoints now finish on time, more or less. One recent checkpoint looked like this:
>
> 2017-10-05 14:16:22.891 EDT [7828] LOG: checkpoint starting: time
> 2017-10-05 14:42:35.429 EDT [7828] LOG: checkpoint complete: wrote 4770679 buffers (25.3%); 0 transaction log file(s) added, 0 removed, 2088 recycled; write=1501.567 s, sync=1.844 s, total=1572.538 s; sync files=750, longest=0.029 s, average=0.002 s; distance=33579763 kB, estimate=35950738 kB
>
> So it took 1502 seconds for writes, instead of the configured 1494, that’s close enough, just 6 seconds more. Sync was another 2 seconds. So 1504 write + sync. However, the total is 1573 seconds, 69 seconds more. Not sure what the checkpoint process does in those 69 seconds.
>
> Looking further back at the logs with the checkpoints from today, the first checkpoint of the day that did some significant work didn’t have such a large difference between write time and checkpoint time, it had write + sync = 1494, and total 1495, just 1 second difference:
>
> 2017-10-05 09:16:22.851 EDT [7828] LOG: checkpoint starting: time
> 2017-10-05 09:41:18.139 EDT [7828] LOG: checkpoint complete: wrote 4445573 buffers (23.6%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=1493.322 s, sync=0.575 s, total=1495.287 s; sync files=734, longest=0.022 s, average=0.000 s; distance=29166105 kB, estimate=29166105 kB
>
> The difference is it didn’t need to recycle so many buffers as the later one. Not sure what exactly the recycling does, if anything, but it seems to add some overhead to the process. The checkpoint with the 69 seconds delay recycled 2088 wal segments, the one with the 1 second delay recycled just 98.
>
> If the overhead is indeed caused by recycling, I’m not sure if it’s proportional to the size of the recycled segments, or with the number of recycled segments. If it’s the number of segments, then maybe a wal segment size larger than 16 MB would help with this issue, but unfortunately for the time being that can only be configured at compile time and forces you to recreate the cluster as well.
>
> For tomorrow I plan to reduce bgwriter_lru_maxpages from 100 to 0 and rely just on checkpoints for datafile writes, and reduce checkpoint_completion_target from 0.83 to 0.66 to try make the checkpoints finish the writes in 20 minutes instead of 25 minutes, out of 30 minutes.
>
> Thanks,
> Vlad

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vladimir Nicolici 2017-10-06 00:45:28 Re: Strange checkpoint behavior - checkpoints take a longtime
Previous Message Vladimir Nicolici 2017-10-05 19:58:31 Re: Strange checkpoint behavior - checkpoints take a long time