| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> | 
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: Non-reproducible AIO failure | 
| Date: | 2025-04-23 22:04:56 | 
| Message-ID: | oj3ydw3irpcrjirw5ihcfcqlav2c26x35fhfq2pk4fo5sl6zh5@c25nlzjx3b6e | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
> My buildfarm animal sifaka just failed like this [1]:
There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.
> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322
Huh.
I assume there's no core file left over? Would be helpful to query some more
state (e.g. whether operation->io_wref is valid (think it has to be) and
whether this was a partial read or such).
> 0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
> 1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
> 2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
> 3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
> 4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
> 5   postgres                            0x00000001009ad8c4 heapgettup + 368
> 6   postgres                            0x00000001009ad364 heap_getnext + 104
> 7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
> 8   postgres                            0x00000001009ef4b8 spgbuild + 480
> 9   postgres                            0x0000000100a41acc index_build + 428
> 10  postgres                            0x0000000100a43d14 reindex_index + 752
> 11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
> 12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
> 13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
> 14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
> 15  postgres                            0x0000000100d02cec PortalRunUtility + 184
> 16  postgres                            0x0000000100d02354 PortalRunMulti + 236
> 17  postgres                            0x0000000100d01d94 PortalRun + 456
> 18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
> ...
> 2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
> 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;
> Scraping the buildfarm database confirmed that there have been no
> other similar failures since AIO went in, and unsurprisingly some
> attempts to reproduce it manually didn't bear fruit.  So I'm
> suspecting a race condition with a very narrow window, but I don't
> have an idea where to look.  Any thoughts on how to gather more
> data?
The weird thing is that the concrete symptom here doesn't actually involve
anything that could really race - the relevant state is only updated by the
process itself.  But of course there has to be something narrow here,
otherwise we'd have seen this before :(.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jacob Champion | 2025-04-23 22:11:21 | Re: What's our minimum supported Python version? | 
| Previous Message | Nathan Bossart | 2025-04-23 21:44:51 | Re: Large expressions in indexes can't be stored (non-TOASTable) |