RE: Test failures of 100_bugs.pl

From: "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Test failures of 100_bugs.pl
Date: 2023-04-21 05:47:57
Message-ID: OSZPR01MB6310D6F48372F52F1D85E1C5FD609@OSZPR01MB6310.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
> > occasionally. Just rarely enough that I never got around to looking into it
> > for real.
> >
> ...
> >
> > We see t2 added to the publication:
> > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0]
> LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2
> >
> > And that *then* "t" was synchronized:
> > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical
> replication table synchronization worker for subscription "testsub", table "t" has
> finished
> >
> > and then that the refresh was issued:
> > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0]
> LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION
> >
> > And we see a walsender starting and the query to get the new tables being
> executed:
> > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG:
> statement: SELECT DISTINCT t.schemaname, t.tablename
> > , t.attnames
> > FROM pg_catalog.pg_publication_tables t
> > WHERE t.pubname IN ('testpub')
> >
> >
> > And that's it, the rest of the time is just polling.
> >
> >
> > Perhaps wait_for_subscription_sync() should dump the set of rel states to
> make
> > something like this more debuggable?
> >
> >
> > The fact that the synchronization for t finished just before the refresh makes
> > me wonder if a wakeup or a cache invalidation got lost?
> >
>
> From the LOGs, the only thing one could draw is lost invalidation
> because the nap time of the apply worker is 1s, so it should process
> invalidation during the time we are polling. Also, the rel should be
> added to pg_subscription_rel because the test is still polling for
> rels to be in 'ready' or 'done' state.
>
> I think we can do three things to debug (a) as you suggest dump the
> rel state in wait_for_subscription_sync; (b) add some DEBUG log in
> invalidate_syncing_table_states() to ensure that invalidation has been
> processed; (c) print rel states and relids from table_states_not_ready
> in process_syncing_tables_for_apply() to see if t2 has ever appeared
> in that list.
>

There was a similar buildfarm failure on francolin recently[1]. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.

--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--

I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.

[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09

Regards,
Shi Yu

Attachment Content-Type Size
v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch application/octet-stream 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-04-21 05:57:58 Re: Support logical replication of DDLs
Previous Message Gurjeet Singh 2023-04-21 05:29:35 Minor code de-duplication in fe-connect.c