LwLocks contention (MultiXactOffsetControlLock/multixact_offset) when running logical replication initial snapshot

From: Marko Sutic <marko(dot)sutic(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: LwLocks contention (MultiXactOffsetControlLock/multixact_offset) when running logical replication initial snapshot
Date: 2024-04-08 12:48:33
Message-ID: CAMD6WPd77S67F41YXuyJbVr-6SiD_2fm2scNS3mZhG7577E10g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,
We are currently using a shared PostgreSQL cluster (version 11.18) that
supports over ten databases. To alleviate the load on this cluster, we've
decided to migrate certain databases to dedicated clusters using the native
logical replication feature. This approach has been successfully applied to
between 50 and 100 databases without issues. However, we have recently
encountered an issue related to LWLocks contention.

The problem happened during the taking of an initial snapshot of a slightly
bigger database, approximately 500GB, with a single table accounting for
300GB. Although the database remained operational, its performance degraded
significantly for some services. Threads experienced delays of 20-30
seconds per simple execution when waiting for the
“LWLock:MultiXactOffsetControlLock” and “multixact_offset” locks, which
also blocked other processes. This issue did not happen immediately but
after a few hours running initial snapshot creation required for logical
replication.

Interestingly, not all databases or queries were impacted. The performance
degradation primarily affected specific queries, which I've listed below
with anonymized table names for confidentiality:

Database "migrated_db":
Insert Query: INSERT INTO library_books (author_id, genre_id, book_id,
publisher, library_id, section_key, content) VALUES ($1, $2, $3, $4, $5,
$6, $7);
Select Query: SELECT $2 FROM ONLY "academic_records"."lecture_series" x
WHERE "professor_id" = $1 FOR KEY SHARE OF x;

Database "other_db":
Update Query: UPDATE "vehicle_registry" SET "mileage_count" = mileage_count
+ $1 WHERE "vehicle_id" = $2 RETURNING "mileage_count";

These queries experienced significant increases in execution time and
shared buffer reads per call. The "library_books" table swelled from 500KB
to nearly 800MB, showing increased bloat and the oldest row age. Noticeable
drop in transaction rate was visible for affected services.
Upon discontinuing the replication, the locks were released, and the
"library_books" table returned to its original size of 500KB, with
performance levels improving correspondingly.

Could you please provide insights on how the initial snapshot for logical
replication could be causing these LWLocks contention issues? Furthermore,
why are only certain queries affected, including some from non-migrated
databases?
Would initiating the snapshot with pg_dump, reducing or temporarily
removing the workload on the affected queries, or making certain parameter
adjustments help resolve this issue?

Thank you for your assistance and insights.

Best regards,
Marko

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2024-04-08 13:37:07 What is referential_action?
Previous Message Ayush Vatsa 2024-04-07 18:22:47 Re: Clarification on View Privileges and Operator Execution in PostgreSQL