Re: speed up a logical replica setup

From: vignesh C <vignesh21(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-02-20 13:34:57
Message-ID: CALDaNm3Q5W=EvphDjHA1n8ii5fv2DvxVShSmQLNFgeiHsOUwPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear hackers,
>
> > Since it may be useful, I will post top-up patch on Monday, if there are no
> > updating.
>
> And here are top-up patches. Feel free to check and include.
>
> v22-0001: Same as v21-0001.
> === rebased patches ===
> v22-0002: Update docs per recent changes. Same as v20-0002.
> v22-0003: Add check versions of the target. Extracted from v20-0003.
> v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was
> slightly changed.
> === Newbie ===
> V22-0005: Addressed my comments which seems to be trivial[1].
> Comments #1, 3, 4, 8, 10, 14, 17 were addressed here.
> v22-0006: Consider the scenario when commands are failed after the recovery.
> drop_subscription() is removed and some messages are added per [2].
> V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were addressed here.
> V22-0008: Fix a strange report when physical_primary_slot is null. Per comment #9 [1].
> V22-0009: Prohibit reuse publications when it has already existed. Per comments #11 and 12 [1].
> V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits soon. Per comment #15 [1].
> V22-0011: Update testcode. Per comments #17- [1].
>
> I did not handle below points because I have unclear points.
>
> a.
> This patch set cannot detect the disconnection between the target (standby) and
> source (primary) during the catch up. Because the connection status must be gotten
> at the same time (=in the same query) with the recovery status, but now it is now an
> independed function (server_is_in_recovery()).
>
> b.
> This patch set cannot detect the inconsistency reported by Shubham [3]. I could not
> come up with solutions without removing -P...

Few comments:
1) The below code can lead to assertion failure if the publisher is
stopped while dropping the replication slot:
+ if (primary_slot_name != NULL)
+ {
+ conn = connect_database(dbinfo[0].pubconninfo);
+ if (conn != NULL)
+ {
+ drop_replication_slot(conn, &dbinfo[0],
primary_slot_name);
+ }
+ else
+ {
+ pg_log_warning("could not drop replication
slot \"%s\" on primary",
+ primary_slot_name);
+ pg_log_warning_hint("Drop this replication
slot soon to avoid retention of WAL files.");
+ }
+ disconnect_database(conn);
+ }

pg_createsubscriber: error: connection to database failed: connection
to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or
directory
Is the server running locally and accepting connections on that socket?
pg_createsubscriber: warning: could not drop replication slot
"standby_1" on primary
pg_createsubscriber: hint: Drop this replication slot soon to avoid
retention of WAL files.
pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database:
Assertion `conn != ((void *)0)' failed.
Aborted (core dumped)

This is happening because we are calling disconnect_database in case
of connection failure case too which has the following assert:
+static void
+disconnect_database(PGconn *conn)
+{
+ Assert(conn != NULL);
+
+ PQfinish(conn);
+}

2) There is a CheckDataVersion function which does exactly this, will
we be able to use this:
+ /* Check standby server version */
+ if ((ver_fd = fopen(versionfile, "r")) == NULL)
+ pg_fatal("could not open file \"%s\" for reading: %m",
versionfile);
+
+ /* Version number has to be the first line read */
+ if (!fgets(rawline, sizeof(rawline), ver_fd))
+ {
+ if (!ferror(ver_fd))
+ pg_fatal("unexpected empty file \"%s\"", versionfile);
+ else
+ pg_fatal("could not read file \"%s\": %m", versionfile);
+ }
+
+ /* Strip trailing newline and carriage return */
+ (void) pg_strip_crlf(rawline);
+
+ if (strcmp(rawline, PG_MAJORVERSION) != 0)
+ {
+ pg_log_error("standby server is of wrong version");
+ pg_log_error_detail("File \"%s\" contains \"%s\",
which is not compatible with this program's version \"%s\".",
+ versionfile,
rawline, PG_MAJORVERSION);
+ exit(1);
+ }
+
+ fclose(ver_fd);

3) Should this be added to typedefs.list:
+enum WaitPMResult
+{
+ POSTMASTER_READY,
+ POSTMASTER_STILL_STARTING
+};

4) pgCreateSubscriber should be mentioned after pg_controldata to keep
the ordering consistency:
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..c5edd244ef 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -285,6 +285,7 @@
&pgCtl;
&pgResetwal;
&pgRewind;
+ &pgCreateSubscriber;
&pgtestfsync;

5) Here pg_replication_slots should be pg_catalog.pg_replication_slots:
+ if (primary_slot_name)
+ {
+ appendPQExpBuffer(str,
+ "SELECT 1 FROM
pg_replication_slots "
+ "WHERE active AND
slot_name = '%s'",
+ primary_slot_name);

6) Here pg_settings should be pg_catalog.pg_settings:
+ * - max_worker_processes >= 1 + number of dbs to be converted
+ *------------------------------------------------------------------------
+ */
+ res = PQexec(conn,
+ "SELECT setting FROM pg_settings
WHERE name IN ("
+ "'max_logical_replication_workers', "
+ "'max_replication_slots', "
+ "'max_worker_processes', "
+ "'primary_slot_name') "
+ "ORDER BY name");

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-02-20 13:46:16 Re: numeric_big in make check?
Previous Message vignesh C 2024-02-20 13:26:42 Re: confirmed flush lsn seems to be move backward in certain error cases