Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition

From: Tender Wang <tndrwang(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition
Date: 2024-03-15 08:05:45
Message-ID: CAHewXNk0sTA-OKc5ezxta16=56ODh2808Vqtu8q-9Zjpdy6DGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Alexander Lakhin <exclusion(at)gmail(dot)com> 于2024年3月13日周三 16:00写道:

> Hi Tender Wang,
>
> 13.03.2024 09:50, Tender Wang wrote:
> > Hi Alexander,
> > I haven't been able to reproduce this issue on my machine(2 vCPU, 2GB
> memory).
> >
> > Can you reproduce this issue reliably on your machine?
>
> Thanks for your attention to this issue!
>
> On my 12-core workstation, where pgbench shows approximately 2500 tps,
> `pgbench -t 10000 -c 40` failed on iterations 1, 1, 3:
> ...
> number of transactions actually processed: 398639/400000
> number of failed transactions: 0 (0.000%)
> latency average = 15.931 ms
> initial connection time = 56.806 ms
> tps = 2510.845701 (without initial connection time)
> pgbench: error: Run was aborted; the above results are incomplete.
> ...
>
> The server built with gcc,
> CPPFLAGS="-O0" ./configure --enable-debug --enable-cassert ...
>
> I have also intensified bgwriter as follows:
> rc = WaitLatch(MyLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> - BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
> + 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
>
> That is, I have the attached modification applied and the following in my
> extra.config:
> fsync = off
> autovacuum = off
> shared_buffers = '1MB'
>

Using your provided patch and configuration, I can reproduce this issue
easily on my mashine(2 vCPU, 2GB).
I went throug the StrategySyncStart() and ClockSweepTick() code. After
d72731a70 commit, operation on StrategyControl->nextVictimBuffer
doesn't need to get the StrategyControl->buffer_strategy_lock, but
StrategyControl->completePasses still needs lock.

Before d72731a70, if bgwriter gets the
StrategyControl->buffer_strategy_lock spinlock in StrategySyncStart(),
backends can't add StrategyControl->nextVictimBuffer,
then bgwriter can get a consistent value consisting of nextVictimBuffer
and completePasses.

But now, even though bgwriter gets the spinlock, backends also could add
the StrategyControl->nextVictimBuffer. In corner cases, bgwriter will see a
litter StrategyControl->nextVictimBuffer
value due to StrategyControl->nextVictimBuffer wraparound. But
StrategyControl->completePasses didn't update. So the
Assert(strategy_delta >= 0) will trigger.

I have two solutions in my head:
1. remove the Assert, but I'm not srue bgwriter write strategy should
change if strategy_delta < 0.
2. add more check in BgBufferSync() just like below:
/*
* Since nextVictimBuffer in StrategyControl has been atomic.
* So its operation would not need to get buffer_strategy_lock.
* In extreme circumstances, StrategySyncStart would see not consistent
* value consisting of nextVictimBuffer and completePasses.
* So we add one to passes_delta to make strategy_delta >= 0.
*/
if (passes_delta == 0 && strategy_delta < 0)
passes_delta++;

I test two times using the second solution, not trigger Assert again. Need
more times to test.

--
Tender Wang
OpenPie: https://en.openpie.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tender Wang 2024-03-15 08:48:34 Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition
Previous Message PG Bug reporting form 2024-03-15 07:40:21 BUG #18395: Checksum verification failed for: deb_postgis_3_4_pg16.app.zip