Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica

From: Rui An <rueian(dot)huang(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
Date: 2021-07-31 19:30:14
Message-ID: CABjTVMR39ci6LUVE2yWhpcufPee4nO3cu32sij3Yh8viksCztQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi, I have recently encountered the problem, as title, on some of my
PostgreSQL 9.6.1 hot standby replicas when serving readonly queries.

The problem makes a lot of PostgreSQL processes stuck on waiting
for MultiXactOffsetControlLock and eventually runs out of
the max_connections quota.

For example:

> mydb=# select count(*) from pg_stat_activity where state = 'active' and
> wait_event = 'MultiXactOffsetControlLock';
> count
> ------
> 956
> (1 row)

I have tried use pg_terminate_backend to kill the stuck prcoesses, but no
help. They still stuck on waiting for MultiXactOffsetControlLock even if
the pg_terminate_backend(pid) returns true.

Currently, the only thing I can do to resolve the problem is to restart my
replicas.

Can someone help me find out how could the problem occurred and is there
any other way to resolve the problem?

I have also recorded some strace and perf top output on stuck processes:

strace:

> strace: Process 555 attached
> select(0, NULL, NULL, NULL, {0, 78} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> semop(60981276, [{0, -1, 0}, 1]) = 0
> semop(60981276, [{0, -1, 0}, 1]) = 0
> semop(61734963, [{14, -1, 0}, 1]) = 0
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> semop(61505580, [{12, -1, 0}, 1]) = 0
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> semop(60981276, [{0, -1, 0}, 1]) = 0
> semop(63471720, [{9, -1, 0}, 1]) = 0
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> semop(60981276, [{0, -1, 0}, 1]) = 0
> semop(63471720, [{9, -1, 0}, 1]) = 0
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
> semop(60981276, [{0, -1, 0}, 1]) = 0

perf top:

> Samples: 208 of event 'cpu-clock', Event count (approx.): 36045672
> Overhead Shared Object Symbol
> 30.74% postgres [.] perform_spin_delay
> 13.70% [kernel] [k] finish_task_switch
> 13.66% [kernel] [k] 0x00007fff81838f35
> 8.59% postgres [.] 0x0000000000319107
> 4.02% postgres [.] 0x0000000000318f04
> 2.47% postgres [.] pg_usleep
> 2.10% libc-2.23.so [.] __select
> 1.68% postgres [.] PGSemaphoreLock
> 1.68% postgres [.] 0x0000000000318eb2
> 1.54% [kernel] [k] pvclock_clocksource_read
> 1.53% postgres [.] GetMultiXactIdMembers
> 1.41% postgres [.] 0x0000000000318ea4
> 1.31% postgres [.] LWLockAcquire
> 1.30% postgres [.] 0x0000000000318e17
> 1.23% [kernel] [k] do_select
> 1.07% postgres [.] finish_spin_delay
> 1.06% [kernel] [k] poll_freewait
> 1.05% libc-2.23.so [.] semop
> 0.96% [kernel] [k] poll_schedule_timeout
> 0.72% [kernel] [k] SYSC_semtimedop
> 0.69% [kernel] [k] core_sys_select
> 0.69% [kernel] [k] pid_vnr
> 0.61% [kernel] [k] dev_gro_receive
> 0.61% libc-2.23.so [.] random
> 0.61% postgres [.] 0x0000000000318e3d
> 0.53% [kernel] [k] hrtimer_active
> 0.53% [kernel] [k] ipc_obtain_object_check
> 0.46% postgres [.] 0x0000000000318f00
> 0.46% postgres [.] 0x0000000000319100
> 0.46% postgres [.] 0x0000000000319103
> 0.41% postgres [.] LWLockRelease
> 0.41% postgres [.] 0x0000000000318f85
> 0.36% [kernel] [k] run_timer_softirq
> 0.36% [kernel] [k] sys_semop

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrey Borodin 2021-08-02 04:57:03 Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
Previous Message Peter Geoghegan 2021-07-31 08:51:44 Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data