From: | "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: Test failures of 100_bugs.pl |
Date: | 2023-04-21 05:47:57 |
Message-ID: | OSZPR01MB6310D6F48372F52F1D85E1C5FD609@OSZPR01MB6310.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
> > occasionally. Just rarely enough that I never got around to looking into it
> > for real.
> >
> ...
> >
> > We see t2 added to the publication:
> > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0]
> LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2
> >
> > And that *then* "t" was synchronized:
> > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical
> replication table synchronization worker for subscription "testsub", table "t" has
> finished
> >
> > and then that the refresh was issued:
> > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0]
> LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION
> >
> > And we see a walsender starting and the query to get the new tables being
> executed:
> > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG:
> statement: SELECT DISTINCT t.schemaname, t.tablename
> > , t.attnames
> > FROM pg_catalog.pg_publication_tables t
> > WHERE t.pubname IN ('testpub')
> >
> >
> > And that's it, the rest of the time is just polling.
> >
> >
> > Perhaps wait_for_subscription_sync() should dump the set of rel states to
> make
> > something like this more debuggable?
> >
> >
> > The fact that the synchronization for t finished just before the refresh makes
> > me wonder if a wakeup or a cache invalidation got lost?
> >
>
> From the LOGs, the only thing one could draw is lost invalidation
> because the nap time of the apply worker is 1s, so it should process
> invalidation during the time we are polling. Also, the rel should be
> added to pg_subscription_rel because the test is still polling for
> rels to be in 'ready' or 'done' state.
>
> I think we can do three things to debug (a) as you suggest dump the
> rel state in wait_for_subscription_sync; (b) add some DEBUG log in
> invalidate_syncing_table_states() to ensure that invalidation has been
> processed; (c) print rel states and relids from table_states_not_ready
> in process_syncing_tables_for_apply() to see if t2 has ever appeared
> in that list.
>
There was a similar buildfarm failure on francolin recently[1]. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.
--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.
014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--
I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.
[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09
Regards,
Shi Yu
Attachment | Content-Type | Size |
---|---|---|
v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch | application/octet-stream | 1.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2023-04-21 05:57:58 | Re: Support logical replication of DDLs |
Previous Message | Gurjeet Singh | 2023-04-21 05:29:35 | Minor code de-duplication in fe-connect.c |