Re: AIO v2.5

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: AIO v2.5
Date: 2025-04-24 04:04:04
Message-ID: a26cad59-ec3e-42ee-956a-3826f051d1f8@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Andres,

24.04.2025 03:40, Andres Freund wrote:
> Hi,
>
> On 2025-04-20 18:00:00 +0300, Alexander Lakhin wrote:
>
>> 2025-04-08 01:41:54.670 UTC [4013251][client backend][0/2:0] DEBUG: waiting for self with 0 pending
> I'd like to extend this debug message with the number of in-flight IOs. I
> assume nobody will mind me doing that.

I printed that number for debugging and always got 1, because
io_max_concurrency == 1 during that test (because of shared_buffers=1MB).
With shared_buffers increased:
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -18,7 +18,7 @@ log_connections=on
 # to avoid "repairing" corruption
 full_page_writes=off
 log_min_messages=debug2
-shared_buffers=1MB
+shared_buffers=16MB

I got 100 iterations passed.

>> I could reproduce this error with the following TEMP_CONFIG:
>> log_min_messages = DEBUG4
>>
>> log_connections = on
>> log_disconnections = on
>> log_statement = 'all'
>> log_line_prefix = '%m [%d][%p:%l][%b] %q[%a] '
>>
>> fsync = on
>> io_method=io_uring
>> backtrace_functions = 'pgaio_io_wait_for_free'
> Hm. Several hundred iterations in without triggering the issue even once, with
> that added config. Given your reproducer used fsync, I assume this was on a
> real filesystem not tmpfs? What FS? I tried on XFS.
>
> very well could just be a timing dependent issue, the time for an fsync will
> differ heavily betweeen devices...

Yeah, it definitely depends on timing.
Using that TEMP_CONFIG changes the duration of the test significantly for me:
Files=1, Tests=14,  2 wallclock secs ( 0.01 usr  0.01 sys +  0.13 cusr  0.32 csys =  0.47 CPU)
->
Files=1, Tests=14, 14 wallclock secs ( 0.00 usr  0.00 sys +  0.14 cusr  0.33 csys =  0.47 CPU)

I reproduced the failure on ext4 (desktop-grade NVME) and on xfs (cheap
SSD), on two different machines.

It's also reproduced on HDD (with ext4), seemingly worse, where the
duration is increased even more:
I 25
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. ok
All tests successful.
Files=1, Tests=14, 28 wallclock secs ( 0.00 usr  0.00 sys +  0.15 cusr  0.39 csys =  0.54 CPU)
Result: PASS
I 26
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. 7/? Bailout called.  Further testing stopped:  aborting wait: program timed out

>> When running 032_relfilenode_reuse.pl in a loop, I got failures on
>> iterations 10, 18, 6:
>> 2025-04-20 15:56:01.310 CEST [][1517068:122][startup] DEBUG: updated min recovery point to 0/4296E90 on timeline 1
>> 2025-04-20 15:56:01.310 CEST [][1517068:123][startup] CONTEXT:  WAL redo at
>> 0/4296E48 for Transaction/COMMIT: 2025-04-20 15:56:01.09363+02; inval msgs:
>> catcache 21; sync
>> 2025-04-20 15:56:01.310 CEST [][1517068:124][startup] DEBUG: waiting for all
>> backends to process ProcSignalBarrier generation 5
>> 2025-04-20 15:56:01.310 CEST [][1517068:125][startup] CONTEXT:  WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
>> 2025-04-20 15:56:01.312 CEST [postgres][1517075:144][client backend]
>> [032_relfilenode_reuse.pl] DEBUG:  waiting for self with 0 pending
>> 2025-04-20 15:56:01.312 CEST [postgres][1517075:145][client backend] [032_relfilenode_reuse.pl] BACKTRACE:
>> pgaio_io_wait_for_free at aio.c:703:2
>>  (inlined by) pgaio_io_acquire at aio.c:186:3
>> AsyncReadBuffers at bufmgr.c:1854:9
>> StartReadBuffersImpl at bufmgr.c:1425:18
>>  (inlined by) StartReadBuffers at bufmgr.c:1497:9
>> read_stream_start_pending_read at read_stream.c:333:25
>> read_stream_look_ahead at read_stream.c:475:3
>> read_stream_next_buffer at read_stream.c:840:6
>> pg_prewarm at pg_prewarm.c:214:10
>> ExecInterpExpr at execExprInterp.c:927:7
>> ExecEvalExprNoReturn at executor.h:447:2
>> ...
>> 2025-04-20 15:56:01.312 CEST [][1517068:126][startup] DEBUG: finished
>> waiting for all backends to process ProcSignalBarrier generation 5
>> 2025-04-20 15:56:01.312 CEST [][1517068:127][startup] CONTEXT:  WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
>> 2025-04-20 15:56:01.314 CEST [postgres][1517075:146][client backend]
>> [032_relfilenode_reuse.pl] ERROR:  no free IOs despite no in-flight IOs
> The earlier report also had a ProcSignalBarrier wait just before. That's
> doesn't have to be related, the test stresses those pretty intentionally.
>
> But it does seem like it could be related, perhaps we somehow end processing
> interrupts while issuing IOs, which then again submits IOs or something? I
> don't immediately see such code, e.g. pgaio_closing_fd() is careful to iterate
> over the list while waiting.
>
> I think pgaio_closing_fd() should get a DEBUG2 (or so) message documenting
> that we are waiting for an IO due to closing an FD. pgaio_shutdown() as well,
> while at it.
>
>
> I somewhat suspect that this is the same issue that you reported earlier, that
> I also couldn't repro. That involved a lot of CREATE DATABASEs, which will
> trigger a lot of procsignal barriers... I'll try to come up with a more
> dedicated stress test and see whether I can trigger a problem that way.

Maybe it does matter that "no free IOs despite no in-flight IOs" follows
"WAL redo at 0/4296E90 for Database/DROP" here (that's seemingly constant)...

I've tried also:
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -705,6 +705,7 @@ pgaio_io_wait_for_free(void)
        pgaio_debug(DEBUG2, "waiting for self with %d pending",
pgaio_my_backend->num_staged_ios);

+pg_sleep(3000);

and got the test failed on each run with:
t/032_relfilenode_reuse.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

032_relfilenode_reuse_primary.log
2025-04-24 05:29:27.103 CEST [3504475] 032_relfilenode_reuse.pl LOG:  statement: UPDATE large SET datab = 1;
2025-04-24 05:29:27.104 CEST [3504475] 032_relfilenode_reuse.pl DEBUG:  waiting for self with 0 pending
2025-04-24 05:29:27.112 CEST [3504369] DEBUG:  releasing pm child slot 3
2025-04-24 05:29:27.112 CEST [3504369] LOG:  client backend (PID 3504475) was terminated by signal 11: Segmentation fault
2025-04-24 05:29:27.112 CEST [3504369] DETAIL:  Failed process was running: UPDATE large SET datab = 1;

>> I also encountered another failure when running this test:
>> t/032_relfilenode_reuse.pl .. 1/? Bailout called.  Further testing stopped:  aborting wait: program died
>>
>> Core was generated by `postgres: standby: law postgres [local] SELECT                   '.
>> Program terminated with signal SIGABRT, Aborted.
>> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>>
>> (gdb) bt
>> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
>> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo(at)entry=6) at ./nptl/pthread_kill.c:89
>> #3  0x0000700a5a24527e in __GI_raise (sig=sig(at)entry=6) at ../sysdeps/posix/raise.c:26
>> #4  0x0000700a5a2288ff in __GI_abort () at ./stdlib/abort.c:79
>> #5  0x0000630ff2c5dcea in errfinish (filename=filename(at)entry=0x630ff2cdd139 "aio.c", lineno=lineno(at)entry=559,
>>     funcname=funcname(at)entry=0x630ff2e85da8 <__func__.9> "pgaio_io_wait") at elog.c:599
>> #6  0x0000630ff2aae79c in pgaio_io_wait (ioh=ioh(at)entry=0x700a57b59700, ref_generation=3972) at aio.c:559
>> #7  0x0000630ff2aaefab in pgaio_io_wait_for_free () at aio.c:771
> Do you still have the core file for this? Or at least the generated log
> message? It'd be useful to know what state the IO was in when the error was
> triggered...

I can reproduce it easily, so here is the message:
2025-04-24 06:10:01.924 EEST [postgres][3036378:150][client backend] [032_relfilenode_reuse.pl] PANIC:  waiting for own
IO in wrong state: 0
2025-04-24 06:10:01.924 EEST [postgres][3036378:151][client backend] [032_relfilenode_reuse.pl] STATEMENT:  SELECT
SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

(Sorry for not including it before — I observed it many times, always with
state 0, so it had lost its value for me.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-04-24 04:45:22 Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX
Previous Message Michael Paquier 2025-04-24 03:26:33 Re: Add Pipelining support in psql