Re: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged
Date: 2021-12-30 19:00:00
Message-ID: b4f83df5-215b-7e89-ced1-ebc571de07e4@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello hackers,
13.11.2021 15:00, PG Bug reporting form wrote:
> When running concurrent installchecks (x100) for src/test/isolation, I've
> observed the following server crash:
> Core was generated by `postgres: postgres regress30 127.0.0.1(55840) SEL'.
> Program terminated with signal SIGABRT, Aborted.
> #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1 0x00007fd1b69d5535 in __GI_abort () at abort.c:79
> #2 0x000055d441cc3a02 in ExceptionalCondition (
> conditionName=conditionName(at)entry=0x55d441e34cd2
> "TransactionIdIsValid(tailXid)",
> errorType=errorType(at)entry=0x55d441d1c01d "FailedAssertion",
> fileName=fileName(at)entry=0x55d441e34bc9 "predicate.c",
> lineNumber=lineNumber(at)entry=928) at assert.c:67
> #3 0x000055d441b8e0aa in SerialAdd
> (minConflictCommitSeqNo=18446744073709551615, xid=95552353) at
> predicate.c:928

The issue is reproduced (on REL_14_STABLE) with the following command
sequence (can be replayed with pgreplay):

2021-12-29 18:00:15.000 MSK|postgres|regression|61cc77ff.18af00|LOG: 
statement:
     create table D1 (id int not null);
     create table D2 (id int not null);
     insert into D1 values (1);
     insert into D2 values (1);
2021-12-29 18:00:15.100 MSK|postgres|regression|61cc77ff.18af00|LOG: 
disconnection: session time: 0:00:01.129 user=postgres
database=regression host=127.0.0.1 port=46730
2021-12-29 18:00:15.200 MSK|postgres|regression|61cc77ff.18af01|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.300 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.400 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: update D2 set id = id + 1;
2021-12-29 18:00:15.500 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: commit;
2021-12-29 18:00:15.600 MSK|postgres|regression|61cc77ff.18af03|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.700 MSK|postgres|regression|61cc77ff.18af01|LOG: 
statement: update D1 set id = id + 1;
2021-12-29 18:00:15.800 MSK|postgres|regression|61cc77ff.18af03|LOG: 
statement: select id from D1;
2021-12-29 18:00:15.800 MSK|postgres|regression|61cc77ff.18af01|LOG: 
disconnection: session time: 0:00:01.000 user=postgres
database=regression host=127.0.0.1 port=46731
2021-12-29 18:00:15.900 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: select pg_sleep(1);

To ease a reproduction, the following delays can be added to predicate.c:
@@ -1502,6 +1507,7 @@ SummarizeOldestCommittedSxact(void)
 {
        SERIALIZABLEXACT *sxact;
 
+       pg_usleep(1000000);
        LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
        /*

@@ -3745,6 +3760,7 @@ ClearOldPredicateLocks(void)
        SERIALIZABLEXACT *finishedSxact;
        PREDICATELOCK *predlock;
 
+       pg_usleep(2000000);
        /*
         * Loop through finished transactions. They are in commit order,
so we can
         * stop as soon as we find one that's still interesting.

With these delays and CPPFLAGS="-DTEST_SUMMARIZE_SERIAL" the assertion
failure can be observed just on `make check`.
As I can see, on disconnect of session 18af01 the
ReleasePredicateLocks() function calls SetNewSxactGlobalXmin() where
SerialSetActiveSerXmin(called with PredXact->SxactGlobalXmin ==
InvalidTransactionId) sets serialControl->tailXid to InvalidTransactionId.
Then session 18af03 calls SummarizeOldestCommittedSxact(), that in case
!SHMQueueEmpty(FinishedSerializableTransactions) calls SerialAdd() where
the check TransactionIdIsValid(serialControl->tailXid) fails.
But if ReleasePredicateLocks() in session 18af01 happens to reach
ReleasePredicateLocks(), the FinishedSerializableTransactions queue gets
empty, and it would prevent calling SerialAdd() in session 18af03.
So there is a small window (that probably can be widened by
LWLockAcquire()) between
            SetNewSxactGlobalXmin();
            needToClear = true;
and
    if (needToClear)
        ClearOldPredicateLocks();
where a concurrent transaction can fail in SummarizeOldestCommittedSxact().

Best regards,
Alexander

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Etsuro Fujita 2021-12-31 07:36:42 Re: BUG #17344: Assert failed on queiring async_capable foreign table with inheritance
Previous Message Reid Thompson 2021-12-30 15:59:01 Re: BUG #17348: Can never install pgsql on Ubuntu