From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Euler Taveira <euler(at)eulerto(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com> |
Subject: | Re: speed up a logical replica setup |
Date: | 2024-07-03 06:20:40 |
Message-ID: | CAA4eK1+Qmc34cooSNm2=U6YsySSjZTn2_eD_deDFEAZv+aj-AA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jul 3, 2024 at 10:42 AM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Based on that, I considered a scenario why the slot could not be synchronized.
> I felt this was not caused by the pg_createsubscriber.
>
> 1. At initial stage, the xmin of the physical slot is 743, and nextXid of the
> primary is also 743.
> 2. Autovacuum worker starts a new transaction. nextXid is incremented to 744.
> 3. Tries to creates a logical replication slot with failover=true *before the
> transaction at step2 is replicated to the standby*.
> 4. While creating the slot, the catalog_xmin must be determined.
> The initial candidate is nextXid (= 744), but the oldest xmin of replication
> slots (=743) is used if it is older than nextXid. So 743 is chosen in this case.
> This operaion is done in CreateInitDecodingContext()->GetOldestSafeDecodingContext().
> 5. After that, the transaction at step2 is reached to the standby node and it
> updates the nextXid.
> 6. Finally runs pg pg_sync_replication_slots() on the standby. It finds a failover
> slot on the primary and tries to create on the standby. However, the
> catalog_xmin on the primary (743) is older than the nextXid of the standby (744)
> so that it skips to create a slot.
>
> To avoid the issue, we can disable the autovacuuming while testing.
>
Your analysis looks correct to me. The test could fail due to
autovacuum. See the following comment in
040_standby_failover_slots_sync.
# Disable autovacuum to avoid generating xid during stats update as otherwise
# the new XID could then be replicated to standby at some random point making
# slots at primary lag behind standby during slot sync.
$publisher->append_conf('postgresql.conf', 'autovacuum = off');
> # Descriptions for attached files
>
> An attached script can be used to reproduce the first failure without pg_createsubscriber.
> It requires to modify the code like [1].
> 0003 patch disables autovacuum for node_p and node_s. I think node_p is enough, but did
> like that just in case. This fixes a second failure.
>
Disabling on the primary node should be sufficient. Let's do the
minimum required to stabilize this test.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2024-07-03 06:41:01 | Changing the state of data checksums in a running cluster |
Previous Message | Amit Kapila | 2024-07-03 06:15:53 | Re: speed up a logical replica setup |