From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Andrew Bille <andrewbille(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18658: Assert in SerialAdd() due to race condition |
Date: | 2024-10-18 08:00:00 |
Message-ID: | 5634e3e1-6e82-b883-ac87-52d7e926a04d@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hello Andrew,
17.10.2024 08:47, Andrew Bille wrote:
> Using a build with no asserts enabled and without that define, I can also see the following errors using the script
> with increased concurrency (Initially we got a report from our client, who encountered these errors in production):
>
> 2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|PANIC: could not access status of transaction 18204
> 2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|DETAIL: Could not open file "pg_serial/0000": No such file
> or directory.
>
> I changed ERROR to PANIC to get a core dump.
>
With debug logging added (look at the patch attached), I could observe
events, leading to the error:
2024-10-18 07:11:29.301 UTC [2370369:55][client backend][96/762:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.301 UTC [2370325:58][client backend][91/0:77843] LOG: !!!SerialSetActiveSerXmin|
serialControl->tailXid: 77832
2024-10-18 07:11:29.301 UTC [2370325:59][client backend][91/0:77843] STATEMENT: INSERT INTO t VALUES(42);
1) SerialSetActiveSerXmin(), called by some backend, sets tailXid, which
corresponds to page 77832 / 1024 = 76.
2024-10-18 07:11:29.301 UTC [2370252:83][client backend][85/794:0] LOG: statement: INSERT INTO t VALUES(42);
...
2024-10-18 07:11:29.309 UTC [2370375:56][client backend][83/767:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:106][client backend][63/810:0] LOG: !!!SerialAdd| xid: 77824,
serialControl->headPage: -1, tailXid: 77832, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 76, targetPage: 76, isNewPage: 1
2024-10-18 07:11:29.309 UTC [2370271:107][client backend][63/810:0] STATEMENT: INSERT INTO t VALUES(42);
2.1) SerialAdd() called when headPage is not defined yet, for xid 77824,
which corresponds to page 77824 / 1024 = 76.
2024-10-18 07:11:29.309 UTC [2370327:86][client backend][97/795:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370379:72][client backend][87/777:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370376:74][client backend][98/781:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370321:74][client backend][71/789:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370277:86][client backend][76/795:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370325:82][client backend][91/791:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370373:64][client backend][84/779:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:108][client backend][63/810:0] LOG: !!!SerialAdd| after SimpleLruZeroPage(),
firstZeroPage: 76, slotno: 0
2024-10-18 07:11:29.309 UTC [2370271:109][client backend][63/810:0] STATEMENT: INSERT INTO t VALUES(42);
2.2) SerialAdd() calls SimpleLruZeroPage() to zero out page 76.
2024-10-18 07:11:29.309 UTC [2370330:96][client backend][1/803:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:95][client backend][70/805:0] LOG: !!!SerialAdd| xid: 77790,
serialControl->headPage: 76, firstZeroPage: 77, targetPage: 75, isNewPage: 0
2024-10-18 07:11:29.309 UTC [2370274:96][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.1) SerialAdd() called for another transaction, with xid 77790, which
corresponds to page 77790 / 1024 = 75, i. e. the page before the page
just zeroed; this page considered not new.
2024-10-18 07:11:29.309 UTC [2370274:97][client backend][70/805:0] LOG: !!!SimpleLruReadPage| pageno: 75, slotno: 16,
shared->page_status[slotno]: 0, shared->page_number[slotno]: 0, shared->page_dirty[slotno]: 0
2024-10-18 07:11:29.309 UTC [2370274:98][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.2) SerialAdd() calls SimpleLruReadPage() for page 75, the page has status
SLRU_PAGE_EMPTY, so SimpleLruReadPage() wants to read it from disk...
2024-10-18 07:11:29.309 UTC [2370257:89][client backend][58/802:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370355:71][client backend][74/780:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:99][client backend][70/805:0] PANIC: could not access status of transaction 77790
2024-10-18 07:11:29.309 UTC [2370274:100][client backend][70/805:0] DETAIL: Could not open file "pg_serial/0002": No
such file or directory.
2024-10-18 07:11:29.309 UTC [2370274:101][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.3) And fails inside SlruPhysicalReadPage(), because that page was never
written to disk.
(I used log_line_prefix = '%m [%p:%l][%b][%v:%x] '; decreased 80 to 50 in
the script posted above, had to run the script several dozen times to
catch the error.)
I also saw a failure, when SerialAdd(), which actions 2 and 3 were
performed by one backend.
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
SLRU-debugging.patch | text/x-patch | 2.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2024-10-18 14:48:08 | BUG #18663: synchronous_standby_names vs synchronous_commit vs pg_stat_replication |
Previous Message | Amit Langote | 2024-10-18 06:14:06 | Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault |