From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | bruno(dot)richard(at)broadsign(dot)com |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: PostgreSQL hot standby Hangs due to AccessExclusiveLock on pg_attribute or pg_type tables |
Date: | 2017-07-11 10:01:27 |
Message-ID: | 20170711.190127.230823359.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hello.
In conclusion, this doesn't seem to be a bug but just a standby
conflict.
At Mon, 19 Jun 2017 13:29:34 -0400, Bruno Richard <bruno(dot)richard(at)broadsign(dot)com> wrote in <CAB-EU3RawZx8-OzMfvswFf6z+Y7GOZf03TZ=bez+PbQX+A4M7Q(at)mail(dot)gmail(dot)com>
> *Summary:*
> PostgreSQL hot standby Hangs due to AccessExclusiveLock on pg_attribute or
> pg_type tables
>
> *Platform information (all instances)*
> PostgreSQL 9.6.2, ltree extension enabled.
> Ubuntu 12.04 64bits
>
> *Description:*
> We have an issue on our Database system since we migrated from PostgreSQL
> 9.3.2 to 9.6.2.
> The issue is that one (or many) of our hot standby instances does an
> AccessExclusiveLock on the pg_attribute or pg_type system tables that never
> gets unlocked, causing hangs on our systems. We have to restart the
> instance to get out of the lock. It happened ~30 times since end of April.
>
> *Steps to reproduce:*
> Looks like a race condition, happens randomly in production only. I am
> trying to replicate in house.
>
> *Description:*
> This is the output from the pg_locks table when the issue occurs:
>
> -[ RECORD 129 ]----+--------------------
> locktype | relation
> database | 16385
> relation | 1249
> page |
> tuple |
> virtualxid |
> transactionid |
> classid |
> objid |
> objsubid |
> virtualtransaction | 1/0
> pid | 19018
> mode | AccessExclusiveLock
> granted | t
> fastpath | f
I'm not sure why it didn't happen on 9.3, but it is very likely
to be the consequence of a standby conflict. You are setting
max_standby_streaming_delay to 1200 seconds so the lock will be
released after the same duration at maximum.
The lock shown above almost certainly is a vacuum-truncation
lock. Usually the lock is released soon by a following
commit. However, if some other conflicting xlogs (for example
vaccum cleanup records from a concurrent vacuum) comes before the
commit the truncation lock won't be released until the conflict
is resolved. This would be what you are looking.
Changing settings following the documentation will resolve that.
https://www.postgresql.org/docs/9.6/static/hot-standby.html#HOT-STANDBY-CONFLICT
regards,
> -[ RECORD 130 ]----+--------------------
> locktype | relation
> database | 16385
> relation | 1249
> page |
> tuple |
> virtualxid |
> transactionid |
> classid |
> objid |
> objsubid |
> virtualtransaction | 170/47
> pid | 10842
> mode | AccessShareLock
> granted | f
> fastpath | f
>
> The locking PID is the postmaster process.
> When the pg_attribute(oid=1249) table is locked, we cannot login to the
> database, it just hangs. It is possible to login to psql without specifying
> the database.
> When the pg_type(oid=1247) table is locked, we can login to the database,
> but not able to perform queries, it hangs.
> While the processes are hung, CPU usage is low and RAM is normal.
>
> pg_logs will output many of these:
> "Process holding the lock: 19018. Wait queue: 6022, 6023, 6038, 6456, 6466,
> 6467, 6718, 7077, 7079."
>
> We noticed that Autovacuum is executed on those tables on the master just
> before the issue happens:
>
> GMT USER: PID:11218 LOG: automatic vacuum of table
> "bssdb_admin.pg_catalog.pg_attribute": index scans: 1
> pages: 22 removed, 116 remain, 0 skipped due to pins, 0 skipped
> frozen
> tuples: 424 removed, 6097 remain, 0 are dead but not yet removable
> buffer usage: 720 hits, 0 misses, 92 dirtied
> avg read rate: 0.000 MB/s, avg write rate: 7.055 MB/s
> system usage: CPU 0.02s/0.08u sec elapsed 0.10 sec
>
> Since we create some temp tables, autovacuum on pg_attribute is executed
> ~70 times per day, and the autovacuum on pg_type is executed ~20 times per
> day.
> The issue is randomly happening (normally every 2 days).
>
> *Configuration Files:*
> ------
> max_connections = 250
> superuser_reserved_connections = 5
> unix_socket_directories = '/tmp'
>
> shared_buffers = 32GB
> temp_buffers = 8MB
> max_prepared_transactions = 5
> work_mem = 16MB
> maintenance_work_mem = 512MB
> max_stack_depth = 2MB
>
> vacuum_cost_delay = 0
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_cost_page_dirty = 20
> vacuum_cost_limit = 200
>
> wal_level = replica
> wal_sync_method = fdatasync
> wal_buffers = 16MB
>
> #checkpoint_timeout = 30min
> max_wal_size = 30GB
> checkpoint_completion_target = 0.90
> max_wal_senders = 30
> wal_keep_segments = 8000
>
> seq_page_cost = 1.0
> random_page_cost = 2.0
> cpu_tuple_cost = 0.5
> cpu_index_tuple_cost = 0.5
>
> effective_cache_size = 120GB
> default_statistics_target = 100
>
> log_destination = 'stderr'
> logging_collector = on
> log_directory = 'pg_log'
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_truncate_on_rotation = on
> log_rotation_age = 1d
> log_rotation_size = 0
> client_min_messages = notice
> log_min_messages = notice
> log_error_verbosity = default
> log_min_error_statement = panic
> log_min_duration_statement = 1000ms
> log_hostname = off
> log_line_prefix = '%t'
> log_statement = 'none'
>
> autovacuum = on
> autovacuum_naptime = 1min
> autovacuum_vacuum_threshold = 500
> autovacuum_analyze_threshold = 250
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = 0
> autovacuum_vacuum_cost_limit = 200
>
> datestyle = 'iso'
> timezone = UTC
> lc_messages = 'en_US.UTF-8'
> lc_monetary = 'en_US.UTF-8'
> lc_numeric = 'en_US.UTF-8'
> lc_time = 'en_US.UTF-8'
>
> log_statement = 'ddl'
> log_line_prefix = '%t USER:%u PID:%p '
> log_connections = on
> log_disconnections = on
> log_lock_waits = on
> log_autovacuum_min_duration = 100ms
> --------
>
> *Configs on hot standby db:*
> --------
> max_connections = 250
> superuser_reserved_connections = 5
> unix_socket_directories = '/tmp'
>
> shared_buffers = 16GB
> temp_buffers = 8MB
> max_prepared_transactions = 5
> work_mem = 16MB
> maintenance_work_mem = 512MB
> max_stack_depth = 2MB
>
> vacuum_cost_delay = 0
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_cost_page_dirty = 20
> vacuum_cost_limit = 200
>
> wal_level = replica
> wal_sync_method = fdatasync
> wal_buffers = 16MB
> hot_standby = on
>
> #checkpoint_timeout = 30min
> max_wal_size = 3GB
> checkpoint_completion_target = 0.90
> max_wal_senders = 30
> #wal_keep_segments = 3000
> #max_standby_archive_delay = 1200s
> max_standby_streaming_delay = 1200s
>
> seq_page_cost = 1.0
> random_page_cost = 2.0
> cpu_tuple_cost = 0.5
> cpu_index_tuple_cost = 0.5
>
> effective_cache_size = 55GB
> default_statistics_target = 100
>
> log_destination = 'stderr'
> logging_collector = on
> log_directory = 'pg_log'
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_truncate_on_rotation = on
> log_rotation_age = 1d
> log_rotation_size = 0
> client_min_messages = notice
> log_min_messages = notice
> log_error_verbosity = default
> log_min_error_statement = panic
> log_min_duration_statement = 1000ms
> log_hostname = off
> log_line_prefix = '%t'
> log_statement = 'none'
>
> autovacuum = on
> autovacuum_naptime = 1min
> autovacuum_vacuum_threshold = 500
> autovacuum_analyze_threshold = 250
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = 0
> autovacuum_vacuum_cost_limit = 200
>
> datestyle = 'iso'
> timezone = UTC
> lc_messages = 'en_US.UTF-8'
> lc_monetary = 'en_US.UTF-8'
> lc_numeric = 'en_US.UTF-8'
> lc_time = 'en_US.UTF-8'
>
> log_line_prefix = '%t USER:%u PID:%p '
> log_connections = on
> log_disconnections = on
> log_lock_waits = on
> ------
>
> Thank you
>
> --
> Bruno Richard | Software Architect
> BroadSign International, LLC. | http://broadsign.com/
> T: +1.514.399.1184 | F: +1.514.399.1187
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | jrogers | 2017-07-11 15:11:28 | BUG #14740: PgAdmin 4 Crashing when creating a PgAgent Schedule |
Previous Message | Michał Lis | 2017-07-11 08:00:45 | Re: BUG #14738: ALTER SERVER for foregin servers not working |