Re: checkpoints taking much longer than expected

From: Michael Loftis <mloftis(at)wgops(dot)com>
To: Tiemen Ruiten <t(dot)ruiten(at)tech-lab(dot)io>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: checkpoints taking much longer than expected
Date: 2019-06-16 14:22:55
Message-ID: CAHDg04vppsie+NFpsLfVyrsgz5Qav0T=hLOoH1Z43AqiWz0Skg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Jun 14, 2019 at 08:02 Tiemen Ruiten <t(dot)ruiten(at)tech-lab(dot)io> wrote:

> Hello,
>
> I setup a new 3-node cluster with the following specifications:
>
> 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2*20 cores)
> 128 GB RAM
> 8x Crucial MX500 1TB SSD's
>
> FS is ZFS, the dataset with the PGDATA directory on it has the following
> properties (only non-default listed):
>
> NAME PROPERTY VALUE SOURCE
> tank/pgsql used 234G -
> tank/pgsql available 3.28T -
> tank/pgsql referenced 234G -
> tank/pgsql compressratio 2.68x -
> tank/pgsql compression lz4 inherited from
> tank
>

You're possibly slowing yourself down a lot by using compression here.
Turning off compression though is multiple steps. You basically have to
turn it off in the FS and then rewrite/copy the data. I'd check in a
scratch area how long it takes to write ~3G of random data with compression
on vs off.

>
> tank/pgsql atime off inherited from
> tank
> tank/pgsql canmount on local
> tank/pgsql xattr sa inherited from
> tank
>
> My postgresql.conf (only changed from default listed):
>
> hba_file = '/var/lib/pgsql/pg_hba.conf'
> listen_addresses = '*'
> max_connections = 800
> shared_buffers = 8GB
> work_mem = 64MB
> maintenance_work_mem = 2GB
> autovacuum_work_mem = 1GB
> dynamic_shared_memory_type = posix
> effective_io_concurrency = 200
> max_worker_processes = 50
> max_parallel_maintenance_workers = 8
> max_parallel_workers_per_gather = 8
> max_parallel_workers = 40
> wal_level = replica
> synchronous_commit = off
> full_page_writes = on
> wal_log_hints = on
> wal_buffers = 128MB
> checkpoint_timeout = 60min
> max_wal_size = 8GB
> min_wal_size = 1GB
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p'
> max_wal_senders = 10
> wal_keep_segments = 20
> hot_standby = on
> hot_standby_feedback = on
> random_page_cost = 1.5
> effective_cache_size = 48GB
> default_statistics_target = 500 # range 1-10000
> idle_in_transaction_session_timeout = 30min # in milliseconds, 0 is
> disabled
> shared_preload_libraries = 'timescaledb, pg_cron'
> max_locks_per_transaction = 512
> timescaledb.max_background_workers = 8
>
> My problem is that checkpoints are taking a long time. Even when I run a
> few manual checkpoints one after the other, they keep taking very long, up
> to 10 minutes:
>
> 2019-06-14 15:21:10.351 CEST [23657] LOG: checkpoint starting: immediate
> force wait
> 2019-06-14 15:25:57.655 CEST [23657] LOG: checkpoint complete: wrote
> 139831 buffers (13.3%); 0 WAL file(s) added, 148 removed, 40 recycled;
> write=284.751 s, sync=0.532 s, total=287.304 s; sync files=537,
> longest=0.010 s, average=0.000 s; distance=2745065 kB, estimate=2841407 kB
> 2019-06-14 15:26:01.988 CEST [23657] LOG: checkpoint starting: immediate
> force wait
> 2019-06-14 15:30:30.430 CEST [23657] LOG: checkpoint complete: wrote
> 238322 buffers (22.7%); 0 WAL file(s) added, 0 removed, 172 recycled;
> write=264.794 s, sync=0.415 s, total=268.441 s; sync files=378,
> longest=0.011 s, average=0.001 s; distance=2834018 kB, estimate=2840668 kB
> 2019-06-14 15:30:44.097 CEST [23657] LOG: checkpoint starting: immediate
> force wait
> 2019-06-14 15:37:01.438 CEST [23657] LOG: checkpoint complete: wrote
> 132286 buffers (12.6%); 0 WAL file(s) added, 54 removed, 96 recycled;
> write=366.614 s, sync=2.975 s, total=377.341 s; sync files=467,
> longest=0.095 s, average=0.006 s; distance=2444291 kB, estimate=2801030 kB
>
> What is going on? It doesn't seem like normal behaviour?
>
--

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John Mikel 2019-06-16 16:03:02 Re: bug regclass::oid
Previous Message Peter Geoghegan 2019-06-16 00:35:52 Re: checkpoints taking much longer than expected