From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date: | 2022-10-17 01:34:41 |
Message-ID: | CAD21AoDq+-V+Z3BSkQYuZQkhSmpgaJAzJdOwAZNcyAUnAWq+Lw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> > fails as reported because the current logical decoding cannot properly
> > handle the case where the decoding restarts from NEW_CID. Since we
> > don't make the association between top-level transaction and its
> > subtransaction while decoding NEW_CID (ie, in
> > SnapBuildProcessNewCid()), two transactions are created in
> > ReorderBuffer as top-txn and have the same LSN. This failure happens
> > on all supported versions.
> >
> > To fix the problem, one idea is that we make the association between
> > top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> > as Tomas proposed. On the other hand, since we don't guarantee to make
> > the association between the top-level transaction and its
> > sub-transactions until we try to decode the actual contents of the
> > transaction, it makes sense to me that instead of trying to solve by
> > making association, we need to change the code which are assuming that
> > it is associated.
> >
> > I've attached the patch for this idea. With the patch, we skip the
> > assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> > which we start decoding the contents of transaction, ie.
> > start_decoding_at in SnapBuild. The minor concern is other way that
> > the assertion check could miss some faulty cases where two unrelated
> > top-transactions could have same LSN. With this patch, it will pass
> > for such a case. Therefore, for transactions that we skipped checking,
> > we do the check when we reach the LSN.
> >
>
> >
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -113,6 +113,15 @@
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *recor
> buf.origptr);
> }
>
> +#ifdef USE_ASSERT_CHECKING
> + /*
> + * Check the order of transaction LSNs when we reached the start decoding
> + * LSN. See the comments in AssertTXNLsnOrder() for details.
> + */
> + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> + AssertTXNLsnOrder(ctx->reorder);
> +#endif
> +
> rmgr = GetRmgr(XLogRecGetRmid(record));
> >
>
> I am not able to think how/when this check will be useful. Because we
> skipped assert checking only for records that are prior to
> start_decoding_at point, I think for those records ordering should
> have been checked before the restart. start_decoding_at point will be
> either (a) confirmed_flush location, or (b) lsn sent by client, and
> any record prior to that must have been processed before restart.
Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch | application/octet-stream | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | osumi.takamichi@fujitsu.com | 2022-10-17 01:45:31 | RE: [Proposal] Add foreign-server health checks infrastructure |
Previous Message | Kyotaro Horiguchi | 2022-10-17 01:16:25 | Re: Improve description of XLOG_RUNNING_XACTS |