Re: BUG #18658: Assert in SerialAdd() due to race condition

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andrew Bille <andrewbille(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18658: Assert in SerialAdd() due to race condition
Date: 2024-10-19 09:00:00
Message-ID: ea48b857-4e07-dd43-375e-564e13f5bfb2@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Heikki,

18.10.2024 23:15, Heikki Linnakangas wrote:
>
> Thanks for the repro, Andrew & Alexander! I was able to reproduce this too. It reproduces very quickly with the script
> you provided, if you add this sleep to ReleasePredicateLocks():
>
> @@ -3654,6 +3667,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
>
>      LWLockRelease(SerializableFinishedListLock);
>
> +    pg_usleep(1000);
> +
>      if (needToClear)
>          ClearOldPredicateLocks();
>
> I think the assertion is too strict. It is normal for tailXid to be invalid in this scenario. The condition is that an
> XID was added to the finished list, but the global xmin has already advanced past that XID. It gets cleared from the
> finished list by the ClearOldPredicateLocks() call, but another backend might call SummarizeOldestCommittedSxact()
> before that.
>
> The attached patch fixes it.
>

Thank you for your attention to this!

I also encountered another (more rare) failure with that script (initially
on REL_16_STABLE, but now I've reproduced this on master too), when it
fails due to ENOSPC. (I could reproduce the failure more or less reliably
by running that script with parallel -j4 using 4 different servers.)

With additional logging added (see attached), I see the following:
2024-10-19 07:34:48.254 UTC [3032898:1][client backend][48/278:0] LOG:  !!!SerialAdd| xid: 19957,
serialControl->headPage: 4294967295, tailXid: 20491, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 20, targetPage: 19,
isNewPage: 1
2024-10-19 07:34:48.254 UTC [3032898:2][client backend][48/278:0] STATEMENT:  INSERT INTO t VALUES(42);
2024-10-19 07:34:48.254 UTC [3032898:3][client backend][48/278:0] LOG:  !!!SerialAdd: isNewPage, firstZeroPage: 20,
targetPage: 19
2024-10-19 07:34:48.254 UTC [3032898:4][client backend][48/278:0] STATEMENT:  INSERT INTO t VALUES(42);
2024-10-19 07:35:05.105 UTC [3032898:5][client backend][48/278:0] ERROR:  could not access status of transaction 0
2024-10-19 07:35:05.105 UTC [3032898:6][client backend][48/278:0] DETAIL:  Could not write to file "pg_serial/11FB3" at
offset 8192: No space left on device.

That is, if SerialAdd() gets xid preceding tailXid and belonging to a
preceding page, the page zeroing loop just runs until ENOSPC.

Your proposed fix (adjusted for REL_16_STABLE) eliminates the issue for me.
Thank you!

Best regards,
Alexander

Attachment Content-Type Size
SerialAdd-debugging.patch text/x-patch 1.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-10-19 16:05:57 Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault
Previous Message Amit Langote 2024-10-19 03:12:57 Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault