Re: Excessive number of replication slots for 12->14 logical replication

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Excessive number of replication slots for 12->14 logical replication
Date: 2022-07-29 09:36:54
Message-ID: CAHut+Pvt9G8Z8DDE0627=GZJFWC6jiF_uymmXX-RzDojQTw2hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Here are some comments for v3-0001:

(all cosmetic / comments)

======

0. <apply>

There were multiple whitespace warnings reported by git apply.

[postgres(at)CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch:36:
indent with spaces.
/*
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch:37:
indent with spaces.
* Cleanup the tablesync slot and the origin tracking if exists.
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch:38:
indent with spaces.
*
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch:39:
indent with spaces.
* This has to be done after updating the state because otherwise if
../patches_misc/v3-0001-fix-excessive-replicating-origin-slots-issue.patch:40:
indent with spaces.
* there is an error while doing the database operations we won't be
warning: squelched 3 whitespace errors
warning: 8 lines add whitespace errors.

======

1. Commit message

I think the Replication Origin tracking is maybe not strictly speaking
a "slot", even though it does use the same GUC as true slots. With
that in mind, perhaps the following text is more accurate.

SUGGESTION
The replication origin tracking uses the same GUC
(max_replication_slots) as the tablesync slots for limiting resources.

In the current (HEAD) code the replication origin tracking of the
completed tablesync worker is dropped by the apply worker, but this
means there can be a small lag between when the tablesync worker
exited, and when its origin tracking is actually dropped.

Meanwhile, new tablesync workers can be launched and will immediately
try to acquire new slots and origin tracking. If this happens before
the worker's origin tracking gets dropped then the available number of
slots (max_replication_slots) can be exceeded, leading to the error as
reported.

To avoid this lag, the dropping of replicating origin tracking is
moved to the tablesync worker where it exits.

======

2. src/backend/replication/logical/tablesync.c - process_syncing_tables_for_sync

@@ -315,13 +316,43 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
*/
walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);

+ /*
+ * Cleanup the tablesync slot and the origin tracking if exists.
+ *

Consider reversing that comment's first sentence so that it describes
what it will do in the same order as the code logic.

SUGGESTION
Cleanup the origin tracking and tablesync slot.

~~~

3.

+ /*
+ * Reset the origin session before dropping.
*
- * This has to be done after updating the state because otherwise if
- * there is an error while doing the database operations we won't be
- * able to rollback dropped slot.
+ * This is required to reset the ownership of the slot
+ * and allow the slot to be dropped.
*/

"slot to be dropped" -> "origin to be dropped" (maybe?)

~~~

4. src/backend/replication/logical/tablesync.c -
process_syncing_tables_for_apply

@@ -451,27 +480,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
started_tx = true;
}

- /*
- * Remove the tablesync origin tracking if exists.
- *
- * The normal case origin drop is done here instead of in the
- * process_syncing_tables_for_sync function because we don't
- * allow to drop the origin till the process owning the origin
- * is alive.
- *
- * There is a chance that the user is concurrently performing
- * refresh for the subscription where we remove the table
- * state and its origin and by this time the origin might be
- * already removed. So passing missing_ok = true.
- */
- ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
- rstate->relid,
- originname,
- sizeof(originname));
- replorigin_drop_by_name(originname, true, false);

/*

This change results in a double blank line remaining instead of just a
single blank line.

------
Kind Regards,
Peter Smith
Fujitsu Australia

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Francisco Olarte 2022-07-29 09:56:13 Re: BUG #17561: Server crashes on executing row() with very long argument list
Previous Message PG Bug reporting form 2022-07-29 09:14:12 BUG #17561: Server crashes on executing row() with very long argument list