Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Re: BF member drongo doesn't like 035_standby_logical_decoding.pl
Date: 2025-01-24 19:23:26
Message-ID: dfptqhj4cpmfeqkgmaeot4pl5dgetpxul4pyhyed3pdiox2ctr@yotyy2jdmioo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for noticing the pattern.

On 2025-01-24 11:42:15 -0500, Tom Lane wrote:
> I realized just now that drongo has been intermittently failing like this:
>
> 147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid)
> ------------------------------------- 8< -------------------------------------
> stderr:
> # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
> # poll_query_until timed out executing this query:
> # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> # Failed test 'confl_active_logicalslot updated'
> # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 255 just after 24.
>

There's a failure before this one:
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class

> This has been happening for some time, in all three branches where
> that test script exists. The oldest failure that looks like that in
> the v16 branch is
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

Hm.

2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at 0/4000180
2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions.
2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
...
2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747.
2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
...
2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249 kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680
...
2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0

There is no further message from 1612 and no intervening message by 3984
(startup) between invalidating row_removal_inactiveslot and reaching the end
of the WAL.

The regress_log does have:

Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done

I.e. this looks to not be a failure of the test to find the right message or
the stats about the terminatio nnot being flushed, but that
row_removal_activeslot isn't getting terminated. And due to the replay_lsn
check on the primary, we know that replication must have progressed past the
point of the conflict. So the problem doesn't appear to be that we're waiting
for termination, but that somehow we determined that termination is not
necessary.

Unfortunately with the logs available I can't really pinpoint why. I think we
need to
a) increase the postgres log level for the standby
b) add some explicit debug logging to report the contents of
pg_replication_slots

I hope the issue is that somehow we didn't need to terminate, rather than
missing the need to terminate and just continuing on with replay. But I can't
think of a scenario in which there wouldn't be a need to terminate...

> However, there are older failures showing a timeout of
> 035_standby_logical_decoding.pl that don't provide any detail, but
> might well be the same thing. The oldest one of those is from
> 2024-05-01, which is still considerably later than the test script
> itself (added on 2023-04-08). So it would seem that this is something
> we broke during 2024, rather than an aboriginal problem in this test.

Looking at the failures I see on drongo, I also started to wonder if there's
something broken around recovery conflict handling. Perhaps windows specific?

There e.g. are the following:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-26%2005%3A49%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2016%3A39%3A49

Which fail like this:
------------------------------------- 8< -------------------------------------
stderr:
# poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
# Failed test 'startup deadlock: lock acquisition is waiting'
# at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 261.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-01-24 19:28:18 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message Tom Lane 2025-01-24 19:20:34 Re: Casts from jsonb to other types should cope with json null