Re: Strange checkpoint behavior - checkpoints take a long time

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Strange checkpoint behavior - checkpoints take a long time
Date: 2017-10-10 14:17:23
Message-ID: d679178f-1c09-e136-f0a2-cedd98f65515@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Vladimir,

maybe your update triggered auto_vacuum on those tables ? Default autovacuum_freeze_max_age is exactly set at 200,000,000 . Did you check your vacuum stats afterwards (pg_stat_*_tables) ?
Can you show the code which performed the deletes?

On 10/10/2017 16:56, Vladimir Nicolici wrote:
>
> I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of
> generating much more WAL than I estimated.
>
> And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible,
> estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.
>
> Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure
> it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints
> triggered by “xlog”.
>
> I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints
> finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future
> increases in activity just fine.
>
> The lesson I learned here is that max_wal_size needs to be configured based on the **maximum** volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set
> based on the **average** volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.
>
> Thanks,
>
> Vlad
>

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vladimir Nicolici 2017-10-10 14:30:33 Re: Strange checkpoint behavior - checkpoints take a longtime
Previous Message Melvin Davidson 2017-10-10 14:03:29 Fwd: [GENERAL] How to sort in pgAdmin 4