Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at>
Subject: Re: AIO v2.5
Date: 2025-04-03 18:19:43
Message-ID: 3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:
> There are three different types of failures in the test_aio test so far:

And a fourth, visible after I enabled liburing support for skink.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check
(ignore the pg_upgrade and oauth failures, they're independent, I've raised
them separately)

4a)

2025-04-03 10:58:32.978 UTC [2486740][client backend][3/6:0] LOG: short read injection point called, is enabled: 0
==2486740== VALGRINDERROR-BEGIN
==2486740== Invalid read of size 2
==2486740== at 0x59C8AC: PageIsNew (bufpage.h:237)
==2486740== by 0x59C8AC: PageIsVerified (bufpage.c:108)
==2486740== by 0x567870: buffer_readv_complete_one (bufmgr.c:6873)
==2486740== by 0x567870: buffer_readv_complete (bufmgr.c:6996)
==2486740== by 0x567870: shared_buffer_readv_complete (bufmgr.c:7153)
==2486740== by 0x55DDB2: pgaio_io_call_complete_shared (aio_callback.c:256)
==2486740== by 0x55D6F1: pgaio_io_process_completion (aio.c:512)
==2486740== by 0x55F53A: pgaio_uring_drain_locked (method_io_uring.c:370)
==2486740== by 0x55F7B8: pgaio_uring_wait_one (method_io_uring.c:449)
==2486740== by 0x55C702: pgaio_io_wait (aio.c:587)
==2486740== by 0x55C8B0: pgaio_wref_wait (aio.c:900)
==2486740== by 0x8639240: read_rel_block_ll (test_aio.c:440)
==2486740== by 0x3B915C: ExecInterpExpr (execExprInterp.c:953)
==2486740== by 0x3B4E4E: ExecInterpExprStillValid (execExprInterp.c:2299)
==2486740== by 0x3F7E97: ExecEvalExprNoReturn (executor.h:445)
==2486740== Address 0x8fa400e is in a rw- anonymous segment
==2486740==
==2486740== VALGRINDERROR-END

The reason for this is that this test unpins the buffer (from the backend's
view), before waiting for the IO. While the AIO subsystem holds a pin,
UnpinBufferNoOwner() marked the buffer as inaccessible:
/*
* Mark buffer non-accessible to Valgrind.
*
* Note that the buffer may have already been marked non-accessible
* within access method code that enforces that buffers are only
* accessed while a buffer lock is held.
*/
VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);

I think to fix this we need to mark buffers as accessible around the
PageIsVerified() call in buffer_readv_complete_one(), IFF they're not pinned
by the backend. Unfortunately, this is complicated by the fact that local
buffers do not have valgrind integration :(, so we should only do that for
local buffers, as otherwise the local buffer stays inaccessible the next time
it is pinned.

4b)

That's not all though, after getting past this failure, I see uninitialized
memory errors for reads into temporary buffers:

==3334031== VALGRINDERROR-BEGIN
==3334031== Conditional jump or move depends on uninitialised value(s)
==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108)
==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644)
==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521)
==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587)
==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900)
==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375)
==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81)
==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126)
==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170)

The reason for this one is, I think, that valgrind doesn't understand io_uring
sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
of commands is somewhat hard to intercept by tools like valgrind and rr.

The best fix for that one would, I think, be to have method_io_uring() iterate
over the IOV and mark the relevant regions as defined? That does fix the
issue at least and does seem to make sense? Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?

4c)

Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
PageIsVerified() causes the *next* read into the same buffer in an IO worker
to fail:

==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
==3339904== by 0x5B93C83: pread (pread64.c:25)
==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56)
==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972)
==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381)
==3339904== by 0xB69622: main (main.c:227)
==3339904== Address 0x7f936d386000 is in a rw- anonymous segment

Because, from the view of the IO worker, that memory is still marked NOACCESS,
even if it since has been marked accessible in the backend.

We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
being in an IO worker, but it seems better to instead explicitly mark the
region accessible in the worker, before executing the IO.

In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
likely too broad. I don't think the same scenario exists when IOs are
executed synchronously in the foreground.

Questions:

1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
need to have special-case logic for that. But it also makes the change less
localized and more "impactful", who knows what kind of skullduggery we have
been getting away with unnoticed.

I haven't written the code up yet, but I don't think it'd be all that much
code to add valgrind support to localbuf.

2) Any better ideas to handle the above issues than what I outlined?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-04-03 18:23:49 Re: pg_recvlogical cannot create slots with failover=true
Previous Message Tom Lane 2025-04-03 18:16:53 Re: psql suggestion "select <tab>" offers nothing, can we get functions like "\df <tab>"