From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed |
Date: | 2024-01-10 14:00:01 |
Message-ID: | 7c025095-5763-17a6-8c80-899b35bd0459@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
10.01.2024 07:26, Michael Paquier wrote:
> On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
>> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
>> 0174c2d21 should be superseded by a patch like discussed (or just have
>> autovacuum = off added)...
> Adding an extra FREEZE offers an extra insurance, so I don't see why
> it would be a problem to add it to stabilize the horizon conflicts on
> the standbys.
As 0174c2d21 added FREEZE already, I meant to add "autovacuum = off" or
apply a fix similar to what we're are discussing here.
>
>> Michael, it definitely increases stability of the test (tens of iterations
>> with 20 tests in parallel performed successfully), although I've managed to
>> see another interesting failure (twice):
>> 13 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13 # at t/035_standby_logical_decoding.pl line 227.
> Something I'd like to confirm here: you still see this failure with
> the patch, but without an extra FREEZE, right? If we do both, the
> test would get more stable, wouldn't it?
Yes, I tested the patch as-is, without FREEZE, but it looks like it doesn't
matter in that case. And sorry for misleading information about missing
VACUUM records in my previous message, please ignore it.
10.01.2024 12:46, Bertrand Drouvot wrote:
>> although I've managed to
>> see another interesting failure (twice):
>> 13 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13 # at t/035_standby_logical_decoding.pl line 227.
>>
> Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:
>
> "
> [11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
> [11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
> "
>
> That seems weird, the inactive slot has been invalidated while the active one is not.
> While it takes a bit longer to invalidate an active slot, I don't think the test can
> move on until both are invalidated (then leading to the tests 24 and 25)). I can
> see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
> succeed while 25 does not.
>
> ...
>
> Would it be possible to also send the standby logs?
Yes, please look at the attached logs. This time I've build postgres with
-DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
wal_keep_size=1GB
wal_debug = on
log_autovacuum_min_duration = 0
log_statement = 'all'
log_min_messages = INFO
The archive attached contains logs from four runs:
recovery-1-ok -- an example of successful run for reference
recovery-7-pruning and recovery-19-pruning -- failures with a failed
subtest 'activeslot slot invalidation is logged with on-access pruning'
recovery-15-vacuum_pg_class -- a failure with a failed
subtest 'activeslot slot invalidation is logged with vacuum on pg_class'
The distinction that I see in the failed run logs, for example in
recovery-15-vacuum_pg_class, 035_standby_logical_decoding_standby.log:
2024-01-10 11:00:18.700 UTC [789618] LOG: REDO @ 0/4020220; LSN 0/4020250: prev 0/401FDE0; xid 753; len 20 -
Transaction/COMMIT: 2024-01-10 11:00:18.471694+00
2024-01-10 11:00:18.797 UTC [789618] LOG: REDO @ 0/4020250; LSN 0/4020288: prev 0/4020220; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754
2024-01-10 11:00:19.013 UTC [789618] LOG: REDO @ 0/4020288; LSN 0/40202C8: prev 0/4020250; xid 0; len 9; blkref #0: rel
1663/16384/2610, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused:
0, redirected: [], dead: [48], unused: []
2024-01-10 11:00:19.111 UTC [789618] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
2024-01-10 11:00:19.111 UTC [789618] DETAIL: The slot conflicted with xid horizon 752.
2024-01-10 11:00:19.111 UTC [789618] CONTEXT: WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:29.109 UTC [789618] LOG: terminating process 790377 to release replication slot "row_removal_activeslot"
2024-01-10 11:00:29.109 UTC [789618] DETAIL: The slot conflicted with xid horizon 752.
2024-01-10 11:00:29.109 UTC [789618] CONTEXT: WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl ERROR: canceling statement due to conflict with
recovery
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl DETAIL: User was using a logical replication slot
that must be invalidated.
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl LOG: released logical replication slot
"row_removal_activeslot"
is an absent message 'obsolete replication slot "row_removal_activeslot"'
and an additional record 'Standby/RUNNING_XACTS', which can be found in
035_standby_logical_decoding_primary.log:
2024-01-10 11:00:18.515 UTC [783410] LOG: xlog bg flush request write 0/4020250; flush: 0/4020250, current is write
0/4020220; flush 0/4020220
2024-01-10 11:00:18.646 UTC [783387] LOG: INSERT @ 0/4020288: - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid
753 oldestRunningXid 754
2024-01-10 11:00:18.702 UTC [790526] 035_standby_logical_decoding.pl LOG: statement: SELECT (select
txid_snapshot_xmin(txid_current_snapshot()) - 753) > 0
2024-01-10 11:00:18.724 UTC [783410] LOG: xlog bg flush request write 0/4020288; flush: 0/4020288, current is write
0/4020250; flush 0/4020250
So perhaps it can affect an active slot invalidation?
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
035-failures.tar.gz | application/gzip | 811.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2024-01-10 14:12:28 | Re: System username in pg_stat_activity |
Previous Message | Magnus Hagander | 2024-01-10 13:59:42 | Re: System username in pg_stat_activity |