Re: BUG #18658: Assert in SerialAdd() due to race condition

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andrew Bille <andrewbille(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18658: Assert in SerialAdd() due to race condition
Date: 2024-10-18 08:00:00
Message-ID: 5634e3e1-6e82-b883-ac87-52d7e926a04d@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Andrew,

17.10.2024 08:47, Andrew Bille wrote:
> Using a build with no asserts enabled and without that define, I can also see the following errors using the script
> with increased concurrency (Initially we got a report from our client, who encountered these errors in production):
>
> 2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|PANIC:  could not access status of transaction 18204
> 2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|DETAIL:  Could not open file "pg_serial/0000": No such file
> or directory.
>
> I changed ERROR to PANIC to get a core dump.
>

With debug logging added (look at the patch attached), I could observe
events, leading to the error:
2024-10-18 07:11:29.301 UTC [2370369:55][client backend][96/762:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.301 UTC [2370325:58][client backend][91/0:77843] LOG:  !!!SerialSetActiveSerXmin|
serialControl->tailXid: 77832
2024-10-18 07:11:29.301 UTC [2370325:59][client backend][91/0:77843] STATEMENT:  INSERT INTO t VALUES(42);

1) SerialSetActiveSerXmin(), called by some backend, sets tailXid, which
 corresponds to page 77832 / 1024 = 76.

2024-10-18 07:11:29.301 UTC [2370252:83][client backend][85/794:0] LOG:  statement: INSERT INTO t VALUES(42);
...
2024-10-18 07:11:29.309 UTC [2370375:56][client backend][83/767:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:106][client backend][63/810:0] LOG:  !!!SerialAdd| xid: 77824,
serialControl->headPage: -1, tailXid: 77832, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 76, targetPage: 76, isNewPage: 1
2024-10-18 07:11:29.309 UTC [2370271:107][client backend][63/810:0] STATEMENT:  INSERT INTO t VALUES(42);

2.1) SerialAdd() called when headPage is not defined yet, for xid 77824,
 which corresponds to page 77824 / 1024 = 76.

2024-10-18 07:11:29.309 UTC [2370327:86][client backend][97/795:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370379:72][client backend][87/777:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370376:74][client backend][98/781:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370321:74][client backend][71/789:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370277:86][client backend][76/795:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370325:82][client backend][91/791:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370373:64][client backend][84/779:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:108][client backend][63/810:0] LOG:  !!!SerialAdd| after SimpleLruZeroPage(),
firstZeroPage: 76, slotno: 0
2024-10-18 07:11:29.309 UTC [2370271:109][client backend][63/810:0] STATEMENT:  INSERT INTO t VALUES(42);

2.2) SerialAdd() calls SimpleLruZeroPage() to zero out page 76.

2024-10-18 07:11:29.309 UTC [2370330:96][client backend][1/803:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:95][client backend][70/805:0] LOG:  !!!SerialAdd| xid: 77790,
serialControl->headPage: 76, firstZeroPage: 77, targetPage: 75, isNewPage: 0
2024-10-18 07:11:29.309 UTC [2370274:96][client backend][70/805:0] STATEMENT:  INSERT INTO t VALUES(42);

3.1) SerialAdd() called for another transaction, with xid 77790, which
 corresponds to page 77790 / 1024 = 75, i. e. the page before the page
 just zeroed; this page considered not new.

2024-10-18 07:11:29.309 UTC [2370274:97][client backend][70/805:0] LOG:  !!!SimpleLruReadPage| pageno: 75, slotno: 16,
shared->page_status[slotno]: 0, shared->page_number[slotno]: 0, shared->page_dirty[slotno]: 0
2024-10-18 07:11:29.309 UTC [2370274:98][client backend][70/805:0] STATEMENT:  INSERT INTO t VALUES(42);

3.2) SerialAdd() calls SimpleLruReadPage() for page 75, the page has status
 SLRU_PAGE_EMPTY, so SimpleLruReadPage() wants to read it from disk...

2024-10-18 07:11:29.309 UTC [2370257:89][client backend][58/802:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370355:71][client backend][74/780:0] LOG:  statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:99][client backend][70/805:0] PANIC:  could not access status of transaction 77790
2024-10-18 07:11:29.309 UTC [2370274:100][client backend][70/805:0] DETAIL:  Could not open file "pg_serial/0002": No
such file or directory.
2024-10-18 07:11:29.309 UTC [2370274:101][client backend][70/805:0] STATEMENT:  INSERT INTO t VALUES(42);

3.3) And fails inside SlruPhysicalReadPage(), because that page was never
 written to disk.

(I used log_line_prefix = '%m [%p:%l][%b][%v:%x] '; decreased 80 to 50 in
the script posted above, had to run the script several dozen times to
catch the error.)

I also saw a failure, when SerialAdd(), which actions 2 and 3 were
performed by one backend.

Best regards,
Alexander

Attachment Content-Type Size
SLRU-debugging.patch text/x-patch 2.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-10-18 14:48:08 BUG #18663: synchronous_standby_names vs synchronous_commit vs pg_stat_replication
Previous Message Amit Langote 2024-10-18 06:14:06 Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault