From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com> |
Cc: | Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Optionally automatically disable logical replication subscriptions on error |
Date: | 2022-03-04 06:55:00 |
Message-ID: | CAD21AoBMxAcbjwrE5Xa9AziXLVMODL4taAGhdEUw+-5AdzDTuA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 2, 2022 at 6:38 PM osumi(dot)takamichi(at)fujitsu(dot)com
<osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
>
> On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > message handling while holding interrupts? That is,
> >
> > HOLD_INTERRUPTS();
> > EmitErrorReport();
> > FlushErrorState();
> > AbortOutOfAny Transaction();
> > RESUME_INTERRUPTS();
> > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > er());
> >
> > I think it's better that we do clean up first and then do other works such as
> > sending the message to the stats collector and updating the catalog.
> I agree. Fixed. Along with this change, I corrected the header comment of
> DisableSubscriptionOnError, too.
>
>
> > Here are some comments on v24 patch:
> >
> > + /* Look up our subscription in the catalogs */
> > + tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > +
> > CStringGetDatum(MySubscription->name));
> >
> > s/catalogs/catalog/
> >
> > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> Changed.
>
>
> > ---
> > + if (!HeapTupleIsValid(tup))
> > + ereport(ERROR,
> > + errcode(ERRCODE_UNDEFINED_OBJECT),
> > + errmsg("subscription \"%s\" does not
> > exist",
> > + MySubscription->name));
> >
> > I think we should use elog() here rather than ereport() since it's a
> > should-not-happen error.
> Fixed
>
>
> > ---
> > + /* Notify the subscription will be no longer valid */
> >
> > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > subscription is still valid actually, but just disabled).
> Fixed. Also, I've made this sentence past one, because of the code place
> change below.
>
>
> > ---
> > + /* Notify the subscription will be no longer valid */
> > + ereport(LOG,
> > + errmsg("logical replication subscription
> > \"%s\" will be disabled due to an error",
> > + MySubscription->name));
> > +
> >
> > I think we can report the log at the end of this function rather than during the
> > transaction.
> Fixed. In this case, I needed to adjust the comment to indicate the processing
> to disable the sub has *completed* as well.
>
> > ---
> > +my $cmd = qq(
> > +CREATE TABLE tbl (i INT);
> > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > +CREATE INDEX tbl_idx ON tbl(i));
> >
> > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > notupdate/delete.
> >
> > Do we need tbl_idx?
> Removed both the replica identity and tbl_idx;
>
>
> > ---
> > +$cmd = qq(
> > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > +sr.srsubstate IN ('s', 'r'));
> > +$node_subscriber->poll_query_until('postgres', $cmd);
> >
> > It seems better to add a condition of srrelid just in case.
> Makes sense. Fixed.
>
>
> > ---
> > +$cmd = qq(
> > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > s.subname =
> > +'sub' AND s.subenabled IS FALSE);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > + or die "Timed out while waiting for subscriber to be disabled";
> >
> > I think that it's more natural to directly check the subscription's subenabled.
> > For example:
> >
> > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> Fixed. I modified another code similar to this for tablesync error as well.
>
>
> > ---
> > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > COUNT(1)
> > += 3 FROM tbl WHERE i = 3);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > + or die "Timed out while waiting for applying";
> >
> > I think it's better to wait for the subscriber to catch up and check the query
> > result instead of using poll_query_until() so that we can check the query result
> > in case where the test fails.
> I modified the code to wait for the subscriber and deleted poll_query_until.
> Also, when I consider the test failure for this test as you mentioned,
> expecting and checking the number of return value that equals 3
> would be better. So, I expressed this point in my test as well,
> according to your advice.
>
>
> > ---
> > +$cmd = qq(DROP INDEX tbl_unique);
> > +$node_subscriber->safe_psql('postgres', $cmd);
> >
> > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > executed even when the query is used only once. It seems a different style from
> > the one in other tap tests. Is there any reason why we do this style for all queries
> > in the tap tests?
> Fixed. I removed the 'cmd' variable itself.
>
>
> Attached an updated patch v26.
Thank you for updating the patch.
Here are some comments on v26 patch:
+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)
This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.
---
+ /*
+ * First, ensure that we log the error message so
that it won't be
+ * lost if some other internal error occurs in the
following code.
+ * Then, abort the current transaction and send the
stats message of
+ * the table synchronization failure in an idle state.
+ */
+ HOLD_INTERRUPTS();
+ EmitErrorReport();
+ FlushErrorState();
+ AbortOutOfAnyTransaction();
+ RESUME_INTERRUPTS();
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ if (MySubscription->disableonerr)
+ {
+ DisableSubscriptionOnError();
+ proc_exit(0);
+ }
+
+ PG_RE_THROW();
If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?
The worker should exit with return code 1.
I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.
---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+ 'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+ "1|3",
+ "subscription sub replicated data");
+
Can we store the result to a local variable to check? I think it's
more consistent with other tap tests.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachment | Content-Type | Size |
---|---|---|
0001-fixup-Optionally-disable-subscriptions-on-error.patch | application/octet-stream | 8.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2022-03-04 07:19:32 | Re: wal_compression=zstd |
Previous Message | Michael Paquier | 2022-03-04 06:44:22 | pg_tablespace_location() failure with allow_in_place_tablespaces |