Re: State of pg_createsubscriber

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: State of pg_createsubscriber
Date: 2024-05-22 09:15:19
Message-ID: CANhcyEWvimA1-f6hSrA=9qkfR5SonFb56b36M++vT=LiFj=76g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
I tried to reproduce the case and found a case where pre-existing
replication objects can cause unwanted scenario:

Suppose we have a setup of nodes N1, N2 and N3.
N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
The subscription created on N1 is replicated to N2 due to streaming replication.

Now, after we run pg_createsubscriber on N2 and start the N2 server,
we get the following logs repetitively:
2024-05-22 11:37:18.619 IST [27344] ERROR: could not start WAL
streaming: ERROR: replication slot "test1" is active for PID 27202
2024-05-22 11:37:18.622 IST [27317] LOG: background worker "logical
replication apply worker" (PID 27344) exited with exit code 1
2024-05-22 11:37:23.610 IST [27349] LOG: logical replication apply
worker for subscription "test1" has started
2024-05-22 11:37:23.624 IST [27349] ERROR: could not start WAL
streaming: ERROR: replication slot "test1" is active for PID 27202
2024-05-22 11:37:23.627 IST [27317] LOG: background worker "logical
replication apply worker" (PID 27349) exited with exit code 1
2024-05-22 11:37:28.616 IST [27382] LOG: logical replication apply
worker for subscription "test1" has started

Note: 'test1' is the name of the subscription created on N1 initially
and by default, slot name is the same as the subscription name.

Once the N2 server is started after running pg_createsubscriber, the
subscription that was earlier replicated by streaming replication will
now try to connect to the publisher. Since the subscription name in N2
is the same as the subscription created in N1, it will not be able to
start a replication slot as the slot with the same name is active for
logical replication between N3 and N1.

Also, there would be a case where N1 becomes down for some time. Then
in that case subscription on N2 will connect to the publication on N3
and now data from N3 will be replicated to N2 instead of N1. And once
N1 is up again, subscription on N1 will not be able to connect to
publication on N3 as it is already connected to N2. This can lead to
data inconsistency.

This error did not happen before running pg_createsubscriber on
standby node N2, because there is no 'logical replication launcher'
process on standby node.

Thanks and Regards,
Shlok Kyal

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-05-22 10:21:34 Re: Avoid orphaned objects dependencies, take 3
Previous Message Ashutosh Bapat 2024-05-22 07:57:07 Re: apply_scanjoin_target_to_paths and partitionwise join