From: | bowenshi <zxwsbg(at)qq(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #17695: Failed Assert in logical replication snapbuild. |
Date: | 2022-11-25 07:44:03 |
Message-ID: | tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406@qq.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer.
The new script and fix code are both in the patch. However, I met new problem after adding fix code. It fails into a new trap following:
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 593, PID: 1576)
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(ExceptionalCondition+0xb9)[0xb1ca28]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(SnapBuildInitialSnapshot+0x1b3)[0x8f3c79]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x9136b9]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(exec_replication_command+0x398)[0x91456a]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostgresMain+0x7be)[0x984e4a]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c0d5b]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c06cd]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8bc9de]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostmasterMain+0x117a)[0x8bc2b5]
postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x7bdb13]
The problem is not very hard to reproduce, and I’m fixing it now.
Regards,
--
bowenshi
---- Replied Message ----
From
Masahiko Sawada<sawada(dot)mshk(at)gmail(dot)com>
Date
11/25/2022 13:58
To
<zxwsbg(at)qq(dot)com> ,
<pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject
Re: BUG #17695: Failed Assert in logical replication snapbuild.
On Thu, Nov 24, 2022 at 7:28 PM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
The following bug has been logged on the website:
Bug reference: 17695
Logged by: 施博文
Email address: zxwsbg(at)qq(dot)com
PostgreSQL version: 14.6
Operating system: centos
Description:
In PG14 or higher version, I notice that SnapBuildRestore don't set
builder->next_phase_at=InvalidTransactionId .
But in SnapBuildSerialize function, the assert check this condition.
Assert(builder->next_phase_at == InvalidTransactionId);
This would cause some problems, and I have repeat it with the perl test case
which I would update later, the problem is following:
TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
File: "snapbuild.c", Line: 1604, PID: 29974)
postgres: master: walsender postgres [local]
START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
postgres: master: walsender postgres [local]
START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
postgres: master: walsender postgres [local]
START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
postgres: master: walsender postgres [local]
START_REPLICATION(exec_replication_command+0x42c)[0x914593]
postgres: master: walsender postgres [local]
START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
postgres: master: walsender postgres [local]
START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]
Thank you for reporting the issue!
I could reproduce this issue with a small change and the following steps:
1. Add sleep after setting SNAPBUILD_BUILDING_SNAPSHOT.
@@ -1409,6 +1409,9 @@ SnapBuildFindSnapshot(SnapBuild *builder,
XLogRecPtr lsn, xl_running_xacts *runn
running->xcnt, running->nextXid)));
SnapBuildWaitSnapshot(running, running->nextXid);
+
+ elog(LOG, "sleep 10s");
+ pg_usleep(10 * 1000000);
}
2. Create and start database instance and create a table and the
replication slot 's1'.
create table test (i int);
select pg_logical_replication_slot('s1', 'test_decoding');
3. Create another replication slot "s2" while a transaction is running.
tx-1:
begin;
insert into test values (1);
tx-2:
select pg_logical_replication_slot('s2', 'test_decoding');
checkpoint;
commit;
Note that replication slot creation will wait for 10 sec after
commiting the tx-1, please go to the next step once the slot has been
created.
4. Start decoding on slot 's2'.
select pg_logical_slot_get_changes('s2', null, null);
The logical decoding will sleep. Please go to the next step after
seeing the log "sleep 10s".
5. While the logical decoding started at step 4 is sleeping, start
decoding on slot 's1'.
select pg_logical_slot_get_changes('s1', null, null);
6. After the decoding on slot 's1' wakes up, it fails due to the
assertion check.
This scenario simulates the case where the logical decoding (re)starts
where RUNNING_XACTS record having a running transactions and it
restores the serialized snapshot when decoding the next RUNNING_XACTS
record. Since we don't reset builder->next_phase_at when restoring a
serialized snapshot, the assertion check in SnapBuildSerialize fails.
Regarding the proposed patch, I've confirmed it fixes this issue. But
I think it's better to reset builder->next_phase_at right after the
following assertion check:
/* consistent snapshots have no next phase */
Assert(ondisk.builder.next_phase_at == InvalidTransactionId);
I could not reproduce this issue with your script in my environment. I
think it's better to include the reproducible test case in the patch
but I'm not sure how to do that without adding sleep/gdb attach.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
unknown_filename | text/html | 10.1 KB |
0001-Fix-snapshot-builder-s-state-after-restoring-snapsho.patch | application/octet-stream | 4.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2022-11-25 09:07:56 | RE: BUG #17695: Failed Assert in logical replication snapbuild. |
Previous Message | Masahiko Sawada | 2022-11-25 05:58:40 | Re: BUG #17695: Failed Assert in logical replication snapbuild. |