Quick Links

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject:	Re: BF member drongo doesn't like 035_standby_logical_decoding.pl
Date:	2025-01-24 19:44:21
Message-ID:	j3dy36nv37uieq54xm2zoi5qzi7gfzohuf24vr5nqtusfxd2dj@xbuc2so4rrfx
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-01-24 14:23:26 -0500, Andres Freund wrote:
> On 2025-01-24 11:42:15 -0500, Tom Lane wrote:
> There's a failure before this one:
> [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class
>
>
> > This has been happening for some time, in all three branches where
> > that test script exists. The oldest failure that looks like that in
> > the v16 branch is
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35
>
> Hm.
>
> 2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at 0/4000180
> 2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions.
> 2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
> ...
> 2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
> 2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747.
> 2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
> ...
> 2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249 kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
> 2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680
> ...
> 2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0
>
> There is no further message from 1612 and no intervening message by 3984
> (startup) between invalidating row_removal_inactiveslot and reaching the end
> of the WAL.
>
> The regress_log does have:
>
> Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
> done
>
>
> I.e. this looks to not be a failure of the test to find the right message or
> the stats about the terminatio nnot being flushed, but that
> row_removal_activeslot isn't getting terminated. And due to the replay_lsn
> check on the primary, we know that replication must have progressed past the
> point of the conflict. So the problem doesn't appear to be that we're waiting
> for termination, but that somehow we determined that termination is not
> necessary.
>
> Unfortunately with the logs available I can't really pinpoint why. I think we
> need to
> a) increase the postgres log level for the standby
> b) add some explicit debug logging to report the contents of
> pg_replication_slots
>
> I hope the issue is that somehow we didn't need to terminate, rather than
> missing the need to terminate and just continuing on with replay. But I can't
> think of a scenario in which there wouldn't be a need to terminate...

Hm, maybe I'm missing something, but isn't it possible for the active slot to
actually progress decoding past the conflict point? It's an active slot, with
the consumer running in the background, so all that needs to happen for that
is that logical decoding progresses past the conflict point. That requires
there be some reference to a newer xid to be in the WAL, but there's nothing
preventing that afaict?

In fact, I now saw this comment:

# Note that pg_current_snapshot() is used to get the horizon. It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin. Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon.

Which seems precisely what's happening here?

If that's the issue, I think we need to find a way to block logical decoding
from making forward progress during the test.

The easiest way would be to stop pg_recvlogical and emit a bunch of changes,
so that the backend is stalled sending out data. But that'd require a hard to
predict amount of data to be emitted, which isn't great.

But perhaps we could do something smarter, by starting a session on the
primary that acquires an access exclusive lock on a relation that logical
decoding will need to access? The tricky bit likely would be that it'd
somehow need to *not* prevent VACUUM on the primary.

If we could trigger VACUUM in a transaction on the primary this would be
easy, but we can't.

Greetings,

Andres

In response to

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl at 2025-01-24 19:23:26 from Andres Freund

Responses

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl at 2025-01-27 07:13:01 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2025-01-24 19:44:54	Re: Convert sepgsql tests to TAP
Previous Message	Masahiko Sawada	2025-01-24 19:28:18	Re: POC: enable logical decoding when wal_level = 'replica' without a server restart