subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Subject: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo
Date: 2024-01-17 23:15:28
Message-ID: CAHut+PuEsekA3e7ThwzWr+Us4x=LzkF7DSrED1UsZTUqNrhCUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I had reported a possible subscription 'disable_on_error' bug found
while reviewing another patch.

I am including my initial report and Nisha's analysis again here so
that this topic has its own thread.

==================
INITIAL REPORT [1]
==================

...
I see now that any ALTER of the subscription's connection, even to
some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over. e.g.

----------
test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication apply
worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
etc...
----------

While experimenting with the bad connection ALTER I also tried setting
'disable_on_error' like below:

ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';

...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.

That looks like it may be a bug. Thoughts?

=====================
ANALYSIS BY NISHA [2]
=====================

Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -

start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...

What is happening in this case is that the control reaches the
function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop
-> maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...

and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -

ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);

if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not connect to the publisher: %s", err)));
...

and due to the bad connection string in the subscription, it will error out.
[28680] ERROR: could not connect to the publisher: invalid port number: "-1"
[3196] LOG: background worker "logical replication apply worker" (PID
28680) exited with exit code 1

Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.

I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPvcf7P2dHbCeWPM4jQ%3DyHqf4WFS_C6eVb8V%3DbcZPMMp7A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CABdArM6ORu%2BKpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-01-17 23:24:41 Re: Improve the connection failure error messages
Previous Message Peter Geoghegan 2024-01-17 23:08:51 Re: Emit fewer vacuum records by reaping removable tuples during pruning