BUG #14422: Sync replication bottleneck with SetLatch

From: digoal(at)126(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14422: Sync replication bottleneck with SetLatch
Date: 2016-11-13 08:43:44
Message-ID: 20161113084344.8766.32603@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14422
Logged by: Zhou Digoal
Email address: digoal(at)126(dot)com
PostgreSQL version: 9.6.1
Operating system: CentOS 6.x x64
Description:

HI,
I test a sync replication in my environment with 4 hosts (32CORE(64HT)
2GB/s net , very good IOPS PCI-E SSDs) .

but i found there is many CPU time consumed by SetLatch when use sync
replication.

this is test case:

```
1. postgresql.conf

listen_addresses = '0.0.0.0'
max_connections = 300
unix_socket_directories = '.'
shared_buffers = 24GB
maintenance_work_mem = 512MB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 0
max_parallel_workers_per_gather = 0
old_snapshot_threshold = -1
backend_flush_after = 0
wal_level = replica
fsync = off
synchronous_commit = remote_write
full_page_writes = off
wal_buffers = 256MB
wal_writer_delay = 10ms
wal_writer_flush_after = 0
checkpoint_timeout = 45min
max_wal_size = 48GB
checkpoint_completion_target = 0.05
checkpoint_flush_after = 0
max_wal_senders = 5
random_page_cost = 1.0
parallel_tuple_cost = 0
parallel_setup_cost = 0
effective_cache_size = 48GB
force_parallel_mode = off
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_timezone = 'PRC'
update_process_title = off
autovacuum = on
log_autovacuum_min_duration = -1
autovacuum_max_workers = 16
autovacuum_naptime = 15s
autovacuum_vacuum_scale_factor = 0.02
autovacuum_analyze_scale_factor = 0.01
vacuum_freeze_table_age = 1500000000
vacuum_multixact_freeze_table_age = 1500000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'

2.

pg_ctl -o "-p 1921 -c synchronous_standby_names='1(b,c)'" start -D
/u01/digoal/pg_root1921

3. 2 standby

recovery.conf

recovery_target_timeline = 'latest'
standby_mode=on
primary_conninfo = 'user=postgres host= port= application_name=b'

4. this is stat

-[ RECORD 1 ]----+------------------------------
pid | 42754
usesysid | 10
usename | postgres
application_name | b
client_addr | xxx.xxx.xxx.xxx
client_hostname |
client_port | 52834
backend_start | 2016-11-07 16:07:26.353563+08
backend_xmin |
state | streaming
sent_location | 2/36798458
write_location | 2/36798458
flush_location | 2/36798458
replay_location | 2/36798458
sync_priority | 1
sync_state | sync
-[ RECORD 2 ]----+------------------------------
pid | 42755
usesysid | 10
usename | postgres
application_name | c
client_addr | xxx.xxx.xxx.xxx
client_hostname |
client_port | 60064
backend_start | 2016-11-07 16:07:26.353765+08
backend_xmin |
state | streaming
sent_location | 2/36798458
write_location | 2/36798458
flush_location | 2/36798458
replay_location | 2/36798458
sync_priority | 2
sync_state | potential

test :

psql -c "create table test(id serial primary key, info text, crt_time
timestamp);"

vi test3.sql
insert into test(info) values (null);

pgbench -M prepared -n -r -P 1 -f ./test3.sql -h xxx.xxx.xxx.xxx -p 1921 -U
postgres -c 64 -j 64 -T 120

insert qps ~= 70000/s

1. when test running, master host's top:

top - 16:09:47 up 37 days, 4:11, 1 user, load average: 3.30, 2.94, 2.59
Tasks: 1707 total, 14 running, 1693 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 9.4%sy, 0.0%ni, 80.6%id, 0.0%wa, 0.0%hi, 1.0%si,
0.0%st
Mem: 529321832k total, 213706132k used, 315615700k free, 524824k
buffers
Swap: 0k total, 0k used, 0k free, 186220740k cached

42754 digoal 20 0 24.9g 9384 1924 R 49.5 0.0 0:18.55 postgres: wal
sender process postgres xxx.xxx.xxx.xx8(52834)
42755 digoal 20 0 24.9g 9340 1880 D 33.4 0.0 0:14.00 postgres: wal
sender process postgres xxx.xxx.xxx.xx9(60064)
43187 digoal 20 0 24.9g 59m 52m S 19.3 0.0 0:01.22 postgres:
postgres postgres xxx.xxx.xxx.xx2(48185)
43222 digoal 20 0 24.9g 53m 46m S 19.3 0.0 0:01.21 postgres:
postgres postgres xxx.xxx.xxx.xx2(48220)
......

2. when test running, master host's perf top :

PerfTop: 39521 irqs/sec kernel:73.0% exact: 0.0% [1000Hz cycles],
(all, 64 CPUs)
--------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ ___________________________
_____________________________________

59139.00 12.0% __mutex_lock_slowpath [kernel.kallsyms]

44628.00 9.0% _spin_lock [kernel.kallsyms]

37980.00 7.7% mutex_spin_on_owner [kernel.kallsyms]

9815.00 2.0% reverse_path_check_proc [kernel.kallsyms]

9105.00 1.8% find_busiest_group [kernel.kallsyms]

8001.00 1.6% schedule [kernel.kallsyms]

7028.00 1.4% reschedule_interrupt [kernel.kallsyms]

6673.00 1.4% hash_search_with_hash_value
/home/digoal/pgsql9.6/bin/postgres
6181.00 1.3% LWLockAttemptLock
/home/digoal/pgsql9.6/bin/postgres
5330.00 1.1% _spin_lock_irqsave [kernel.kallsyms]

4688.00 1.0% LWLockRelease
/home/digoal/pgsql9.6/bin/postgres
4544.00 0.9% GetSnapshotData
/home/digoal/pgsql9.6/bin/postgres

3. test end, see postgresql master's gmon :

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
70.57 6.33 6.33 WalRcvSigUsr1Handler
12.76 7.48 1.15 9218867 0.00 0.00 SetLatch
# calls = insert(s) 。
2.23 7.68 0.20 1323523 0.00 0.00 SyncRepWakeQueue
1.11 7.78 0.10 841707 0.00 0.00 LWLockRelease
1.00 7.87 0.09 841670 0.00 0.00
SyncRepReleaseWaiters
1.00 7.96 0.09 753520 0.00 0.00 ProcessRepliesIfAny

granularity: each sample hit covers 2 byte(s) for 0.11% of 8.97 seconds

index % time self children called name
<spontaneous>
[1] 70.6 6.33 0.00 WalRcvSigUsr1Handler [1]
-----------------------------------------------
<spontaneous>
[2] 28.3 0.00 2.54 PostgresMain [2]
0.00 2.53 2/2 exec_replication_command
[3]
0.01 0.00 1/1 SendPostmasterSignal [70]
0.00 0.00 1/1 InitPostgres [85]
0.00 0.00 1/1 BeginReportingGUCOptions
[164]
0.00 0.00 2/2 ReadyForQuery [203]
0.00 0.00 1/1 InitProcess [212]
0.00 0.00 1/1 InitDeadLockChecking
[220]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/18 pq_endmessage [166]
0.00 0.00 2/841672 pq_getmessage [48]
0.00 0.00 2/1595193 pq_startmsgread [36]
0.00 0.00 1/1 InitBufferPoolAccess
[268]
0.00 0.00 2/2 pgstat_report_stat [285]
0.00 0.00 2/2 pgstat_report_activity
[284]
0.00 0.00 2/3
SetCurrentStatementStartTimestamp [273]
0.00 0.00 1/1 InitWalSender [301]
0.00 0.00 1/1 BaseInit [304]
0.00 0.00 1/35 AllocSetContextCreate
[204]
0.00 0.00 1/4120105 AllocSetFree [41]
0.00 0.00 2/34 pq_sendint [242]
0.00 0.00 1/8 MemoryContextDelete <cycle
1> [393]
0.00 0.00 2/2 pq_getbyte [317]
0.00 0.00 2/2 MemoryContextReset [505]
0.00 0.00 2/2
IsAbortedTransactionBlockState [499]
0.00 0.00 2/2 pg_any_to_server [579]
0.00 0.00 2/2 pg_client_to_server [580]
0.00 0.00 2/2 pq_getmsgstring [591]
0.00 0.00 2/2 pq_getmsgend [590]
0.00 0.00 2/2
IsTransactionOrTransactionBlock [503]
0.00 0.00 2/2 ProcessCompletedNotifies
[510]
0.00 0.00 2/3 set_ps_display [482]
0.00 0.00 1/1 process_postgres_switches
[774]
0.00 0.00 1/2 pqinitmask [592]
0.00 0.00 1/3 load_libraries [473]
0.00 0.00 1/1
process_session_preload_libraries [775]
0.00 0.00 1/1 freeBlocks_cmp [753]
0.00 0.00 1/25 pqsignal [361]
0.00 0.00 1/1 WalSndSignals [717]
-----------------------------------------------
0.00 2.53 2/2 PostgresMain [2]
[3] 28.2 0.00 2.53 2 exec_replication_command [3]
0.02 2.51 1/1 WalSndLoop [4]
0.00 0.00 1/1 SyncRepInitConfig [182]
0.00 0.00 2/753466 GetFlushRecPtr [21]
0.00 0.00 3/18 pq_endmessage [166]
0.00 0.00 31/34 pq_sendint [242]
0.00 0.00 1/753524 internal_flush [24]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/709417 internal_putbytes [38]
0.00 0.00 2/2 replication_scanner_init
[263]
0.00 0.00 7/5675256 appendBinaryStringInfo
[50]
0.00 0.00 2/2 replication_yyparse [279]
0.00 0.00 2/1683364 errstart [57]
0.00 0.00 2/35 AllocSetContextCreate
[204]
0.00 0.00 1/709400 pq_sendbyte [67]
0.00 0.00 1/4120105 AllocSetFree [41]
0.00 0.00 1/8 MemoryContextDelete <cycle
1> [393]
0.00 0.00 4/26 pq_sendstring [359]
0.00 0.00 2/2
SnapBuildClearExportedSnapshot [520]
0.00 0.00 1/3 set_stack_base [483]
0.00 0.00 1/14 RecoveryInProgress [379]
0.00 0.00 1/2 WalSndSetState [525]
-----------------------------------------------
0.02 2.51 1/1 exec_replication_command
[3]
[4] 28.2 0.02 2.51 1 WalSndLoop [4]
0.09 1.89 753520/753520 ProcessRepliesIfAny [5]
0.07 0.13 753464/753464 XLogSendPhysical [9]
0.01 0.12 753457/753457 WaitLatchOrSocket [11]
0.06 0.00 753520/753521 ResetLatch [15]
0.04 0.00 3014018/3014018
pq_cleanup_redirect_to_shm_mq [23]
0.01 0.02 753521/753524 internal_flush [24]
0.02 0.00 753519/753519 WalSndKeepaliveIfNecessary
[37]
0.02 0.00 753520/753520 PostmasterIsAlive [42]
0.01 0.00 753519/753519 WalSndCheckTimeOut [52]
0.00 0.01 753457/753457 WalSndComputeSleeptime
[60]
0.00 0.01 1/1 proc_exit [62]
0.00 0.00 753520/1744598 GetCurrentTimestamp [56]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/709417 internal_putbytes [38]
0.00 0.00 1/1683364 errstart [57]
0.00 0.00 1/2 WalSndSetState [525]
0.00 0.00 1/3 set_stack_base [483]
-----------------------------------------------
0.09 1.89 753520/753520 WalSndLoop [4]
[5] 22.1 0.09 1.89 753520 ProcessRepliesIfAny [5]
0.09 1.64 841670/841670 SyncRepReleaseWaiters [6]
0.02 0.05 3366680/3366680 pq_getmsgint64 [12]
0.03 0.00 1595190/1595190 pq_getbyte_if_available
[25]
0.02 0.00 1595190/1595193 pq_startmsgread [36]
0.01 0.00 841670/1683309 elog_finish [27]
0.02 0.00 1683340/1683340 pq_getmsgbyte [43]
0.01 0.00 841670/841672 pq_getmessage [48]
0.01 0.00 841670/1683309 elog_start [51]
0.00 0.00 281670/1744598 GetCurrentTimestamp [56]
0.00 0.00 841670/3102136 resetStringInfo [65]
0.00 0.00 186/25758 s_lock [82]
-----------------------------------------------
0.09 1.64 841670/841670 ProcessRepliesIfAny [5]
[6] 19.2 0.09 1.64 841670 SyncRepReleaseWaiters [6]
0.20 1.15 1323523/1323523 SyncRepWakeQueue [7]
0.10 0.05 841624/841707 LWLockRelease [10]
0.00 0.06 841624/841624 SyncRepGetSyncStandbys
[13]
0.01 0.04 841624/841707 LWLockAcquire [17]
0.01 0.00 841624/1683309 elog_finish [27]
0.00 0.00 841624/1683256 list_free_private [40]
0.00 0.00 841624/1683309 elog_start [51]
0.00 0.00 841624/4120105 AllocSetFree [41]
0.00 0.00 1/10 tuplestore_set_eflags
[79]
0.00 0.00 1/4 errmsg [92]
0.00 0.00 163/25758 s_lock [82]
0.00 0.00 1/1683364 errstart [57]
0.00 0.00 841624/4120093 pfree [320]
0.00 0.00 841624/1683255 list_free [321]
-----------------------------------------------
0.20 1.15 1323523/1323523 SyncRepReleaseWaiters [6]
[7] 15.0 0.20 1.15 1323523 SyncRepWakeQueue [7]
1.15 0.00 9218863/9218867 SetLatch [8]
0.00 0.00 9218863/9218876 SHMQueueDelete [319]
-----------------------------------------------
0.00 0.00 1/9218867 handle_sig_alarm [215]
0.00 0.00 1/9218867 ProcKill [213]
0.00 0.00 1/9218867 InitProcess [212]
0.00 0.00 1/9218867 WalSndLastCycleHandler
[214]
1.15 0.00 9218863/9218867 SyncRepWakeQueue [7]
[8] 12.8 1.15 0.00 9218867 SetLatch [8]
-----------------------------------------------
0.07 0.13 753464/753464 WalSndLoop [4]
[9] 2.2 0.07 0.13 753464 XLogSendPhysical [9] //
真正发送的开销只占了2.2 %
0.04 0.00 753464/753466 GetFlushRecPtr [21]
0.02 0.01 2837588/2837588 pq_sendint64 [31]
0.02 0.00 709397/709397 XLogRead [34]
0.02 0.00 709397/709417 internal_putbytes [38]
0.01 0.00 709397/709397 bms_hash_value [49]
0.00 0.00 709397/709400 pq_sendbyte [67]
0.00 0.00 709397/1744598 GetCurrentTimestamp [56]
0.00 0.00 1418794/3102136 resetStringInfo [65]
0.00 0.00 406/25758 s_lock [82]
0.00 0.00 709397/7226325 enlargeStringInfo [296]

3.2 potential wal sender's gmon , no SetLatch out。

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
81.52 6.22 6.22 WalRcvSigUsr1Handler
1.57 6.34 0.12 788548 0.00 0.00 ProcessRepliesIfAny
1.18 6.43 0.09 788495 0.00 0.00 GetFlushRecPtr
1.05 6.51 0.08 897403 0.00 0.00
SyncRepGetSyncStandbys
1.05 6.59 0.08 788488 0.00 0.00 WaitEventSetWait
0.79 6.65 0.06 788549 0.00 0.00 ResetLatch
0.72 6.71 0.06 4378235 0.00 0.00 pfree
```

--------------------------------------------------------------------
this is not a network problem, because i test another case, install 8 KVM vm
host in master, one VM install one postgresql master,

per master also set 2 standbys from other 2 HOSTs, also use sync standby,
test it again.

```
there is no SetLatch lock's problem, and insert qps up from 70000/s to
190000/s.
```

and i also test this case, init 8 postgresql instances in master, and also
set 2 standbys from other 2 HOSTs, also use sync standby, test it again.


but the qps not upgrade.

```
also only 70000/s (8 postgresql's summary)
```

and i also test this case, use cgroup cpuset, limit all 8 postgresql master
instances use only 10 CPU HTs, the performance upgrade to summary 100000/s.


and i also test this case, use cgroup cpuset, , limit 1 postgresql master
instances use only 10 CPU HTs, and test only this instance, the performance
upgrade to 100000/s.

and i also test this case, use cgroup cpuset, , isolate 8 postgresql master
instances , one use 8 CPU HTs, but performance reduce to 70000/s again.

so i think ,this is a problem by SetLatch? OS sync mutex lock?

My company friend test MySQL's semi-sync in the same HOSTs, this insert test
case, MySQL can reach to 220000/s qps.

MySQL use another thread to release sleeped use threads , not the same
thread which send REDO.


best regards.

Browse pgsql-bugs by date

  From Date Subject
Next Message Jari Kopponen 2016-11-14 07:18:03 Re: BUG #14419: DB connections fails with "could not reserve shared memory region" in postgresql log.
Previous Message Julien Rouhaud 2016-11-12 07:50:35 Re: BUG #14419: DB connections fails with "could not reserve shared memory region" in postgresql log.