Re: Reorderbuffer crash during recovery

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reorderbuffer crash during recovery
Date: 2019-11-06 12:10:52
Message-ID: CAFiTN-vb69GF-BMWy8G56_VDW5qJbCS2_mvNgiDRjUfzNB0Dkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> Hi,
>
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
> fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
> at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
> fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert. Am I
missing something?

>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.
> Thoughts?
>

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mircea Pirv 2019-11-06 13:37:31 Re: BUG #16094: Database entering recovery mode
Previous Message vignesh C 2019-11-06 11:50:02 Reorderbuffer crash during recovery

Browse pgsql-hackers by date

  From Date Subject
Next Message Leif Gunnar Erlandsen 2019-11-06 12:24:41 Re: pause recovery if pitr target not reached
Previous Message vignesh C 2019-11-06 11:50:02 Reorderbuffer crash during recovery