From: | Joshua Banton <bantonj(at)gmail(dot)com> |
---|---|
To: | pgsql-performance(at)lists(dot)postgresql(dot)org |
Subject: | High System CPU Usage on Selects Seemingly Caused By Vacuum of Same Table |
Date: | 2025-01-31 22:28:16 |
Message-ID: | CAHkYM9bmCEq8J=tUMYr6HRNL03ddUX3JUPCKY6frngBCbuPrGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hello,
We are experiencing a high system cpu issue with our database that we can't
quite explain.
Description of the issue:
The database instance cpu suddenly increases to over 90%, the majority of
it as reported by RDS Enhanced Monitoring being system cpu, as opposed to
coming from Postgres itself. IOPS usage barely changes, plenty of headroom.
The issue mostly manifests near the end of the "scanning heap" phase of
vacuuming of one of our largest tables, we'll call table1. RDS Performance
Insights reports that selects on table1 start to wait on cpu, where
previously it didn't even show up in the top 25 queries by wait. It doesn't
always happen, but if there is a larger than usual number of selects on
table1 it is more likely to happen.
Example Timeline:
00:00 vacuum starts on table1 (has at least 10 million dead tuples as
reported by pg_stat_user_tables)
02:00 database cpu p90 12%, RDS Performance Insights CPU waits 11%, IOPS
under 100k, table1 not in top 25 waits in RDS Performance Insights, vacuum
reports 6GB out of 3TB remaining to scan
02:01 database cpu p90 98%, RDS Performance Insights CPU waits 73%, IOPS
under 100k, table1 selects now #1, #2, #6 in RDS Performance Insights
making up the majority of the cpu waits
02:42 vacuum on table1 finishes, database cpu p90 11%, RDS Performance
Insights CPU waits 12%, IOPS under 100k, table1 not in top 25 waits in RDS
Performance Insights
Installation Details:
AWS RDS Managed Instance
Postgres Version: PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc
(GCC) 7.3.1 20180712 (Red Hat 7.3.1-17), 64-bit
Instance Class: r7i.48xlarge
vCPU: 192
RAM: 1536 GB
Storage Type: Provisioned IOPS SSD (io2), 180000 IOPS
Application Language/Framework: Ruby on Rails
Pooler: PgBouncer 1.17.0
Config:
|name |current_setting
|-------------------------------------------|---------------------------------------
|autovacuum_analyze_scale_factor |0.005
|autovacuum_freeze_max_age |500000000
|autovacuum_max_workers |24
|autovacuum_naptime |15s
|autovacuum_vacuum_cost_delay |1ms
|autovacuum_vacuum_cost_limit |2000
|autovacuum_vacuum_scale_factor |0.01
|autovacuum_work_mem |48756174kB
|checkpoint_completion_target |0.9
|checkpoint_timeout |1h
|default_toast_compression |lz4
|effective_cache_size |780098792kB
|enable_presorted_aggregate |on
|fsync |on
|full_page_writes |on
|gss_accept_delegation |off
|hot_standby |off
|huge_pages |on
|icu_validation_level |error
|idle_in_transaction_session_timeout |30s
|ignore_invalid_pages |off
|jit |off
|lo_compat_privileges |off
|log_autovacuum_min_duration |0
|log_checkpoints |on
|log_connections |on
|log_disconnections |on
|log_min_duration_statement |10ms
|log_rotation_age |1h
|log_temp_files |0
|log_timezone |UTC
|log_truncate_on_rotation |off
|logging_collector |on
|maintenance_work_mem |25577009kB
|max_connections |5000
|max_locks_per_transaction |64
|max_logical_replication_workers |10
|max_parallel_apply_workers_per_subscription|2
|max_parallel_workers |96
|max_prepared_transactions |0
|max_replication_slots |10
|max_slot_wal_keep_size |400000MB
|max_stack_depth |6MB
|max_standby_archive_delay |-1
|max_standby_streaming_delay |2147483647ms
|max_wal_senders |25
|max_wal_size |200000MB
|max_worker_processes |20
|min_wal_size |128MB
|random_page_cost |1.05
|recovery_init_sync_method |syncfs
|recovery_prefetch |off
|remove_temp_files_after_crash |off
|reserved_connections |2
|scram_iterations |4096
|send_abort_for_crash |off
|send_abort_for_kill |off
|session_preload_libraries |
|shared_buffers |390049392kB
|shared_preload_libraries
|rdsutils,pg_tle,pg_stat_statements
|synchronous_commit |on
|track_activity_query_size |4kB
|track_functions |pl
|track_io_timing |on
|track_wal_io_timing |on
|vacuum_buffer_usage_limit |399410kB
|vacuum_cost_page_miss |5
|vacuum_failsafe_age |1200000000
|vacuum_multixact_failsafe_age |1200000000
|wal_compression |lz4
|wal_keep_size |400000MB
|wal_level |logical
|wal_receiver_create_temp_slot |off
|wal_receiver_timeout |15min
|wal_sender_timeout |2min
Large Table Stats:
table1:
rows: 2,303,386,400
disk space: 3.5T
rel size: 3T
FSM size: 758 MB
visibility map size: 94 MB
table2:
rows: 1,919,360,500
disk space: 1.6T
rel size: 758G
FSM size: 189 MB
visibility map size: 24 MB
table3:
rows: 2,412,903,000
disk space: 741G
rel size: 387G
FSM size: 96 MB
visibility map size: 12 MB
Any suggestions or insights greatly appreciated!
Joshua Banton
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2025-01-31 22:45:48 | Re: High System CPU Usage on Selects Seemingly Caused By Vacuum of Same Table |
Previous Message | Tom Lane | 2025-01-29 15:10:09 | Re: Why ORing with a false one-time filter turns an Index-Lookup into a SeqScan |