From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
Cc: | Euler Taveira <euler(at)eulerto(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Euler Taveira <euler(dot)taveira(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org> |
Subject: | Re: State of pg_createsubscriber |
Date: | 2024-05-22 11:42:40 |
Message-ID: | CAA4eK1KcprYdxWwJMoX7HvXcsuPV4ZHUKstdRY0NEOx=VtEJTA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, May 22, 2024 at 2:45 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> I tried to reproduce the case and found a case where pre-existing
> replication objects can cause unwanted scenario:
>
> Suppose we have a setup of nodes N1, N2 and N3.
> N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
> N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
> The subscription created on N1 is replicated to N2 due to streaming replication.
>
> Now, after we run pg_createsubscriber on N2 and start the N2 server,
> we get the following logs repetitively:
> 2024-05-22 11:37:18.619 IST [27344] ERROR: could not start WAL
> streaming: ERROR: replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:18.622 IST [27317] LOG: background worker "logical
> replication apply worker" (PID 27344) exited with exit code 1
> 2024-05-22 11:37:23.610 IST [27349] LOG: logical replication apply
> worker for subscription "test1" has started
> 2024-05-22 11:37:23.624 IST [27349] ERROR: could not start WAL
> streaming: ERROR: replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:23.627 IST [27317] LOG: background worker "logical
> replication apply worker" (PID 27349) exited with exit code 1
> 2024-05-22 11:37:28.616 IST [27382] LOG: logical replication apply
> worker for subscription "test1" has started
>
> Note: 'test1' is the name of the subscription created on N1 initially
> and by default, slot name is the same as the subscription name.
>
> Once the N2 server is started after running pg_createsubscriber, the
> subscription that was earlier replicated by streaming replication will
> now try to connect to the publisher. Since the subscription name in N2
> is the same as the subscription created in N1, it will not be able to
> start a replication slot as the slot with the same name is active for
> logical replication between N3 and N1.
>
> Also, there would be a case where N1 becomes down for some time. Then
> in that case subscription on N2 will connect to the publication on N3
> and now data from N3 will be replicated to N2 instead of N1. And once
> N1 is up again, subscription on N1 will not be able to connect to
> publication on N3 as it is already connected to N2. This can lead to
> data inconsistency.
>
So, what shall we do about such cases? I think by default we can
remove all pre-existing subscriptions and publications on the promoted
standby or instead we can remove them based on some switch. If we want
to go with this idea then we might need to distinguish the between
pre-existing subscriptions and the ones created by this tool.
The other case I remember adding an option in this tool was to avoid
specifying slots, pubs, etc. for each database. See [1]. We can
probably leave if the same is not important but we never reached the
conclusion of same.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Lakhin | 2024-05-22 12:00:00 | Re: Testing autovacuum wraparound (including failsafe) |
Previous Message | Peter Eisentraut | 2024-05-22 11:40:27 | Re: Pgoutput not capturing the generated columns |