From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-25 19:39:56 |
Message-ID: | 20250325193956.57.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> On 2025-03-25 08:58:08 -0700, Noah Misch wrote:
> > While having nagging thoughts that we might be releasing FDs before io_uring
> > gets them into kernel custody, I tried this hack to maximize FD turnover:
> >
> > static void
> > ReleaseLruFiles(void)
> > {
> > #if 0
> > while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
> > {
> > if (!ReleaseLruFile())
> > break;
> > }
> > #else
> > while (ReleaseLruFile())
> > ;
> > #endif
> > }
> >
> > "make check" with default settings (io_method=worker) passes, but
> > io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
> > runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
> > Here's the richer of the two diffs:
>
> Yikes. That's a very good catch.
>
> I spent a bit of time debugging this. I think I see what's going on - it turns
> out that the kernel does *not* open the FDs during io_uring_enter() if
> IOSQE_ASYNC is specified [1]. Which we do add heuristically, in an attempt to
> avoid a small but measurable slowdown for sequential scans that are fully
> buffered (c.f. pgaio_uring_submit()). If I disable that heuristic, your patch
> above passes all tests here.
Same result here. As an additional data point, I tried adding this so every
reopen gets a new FD number (leaks FDs wildly):
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1304,5 +1304,5 @@ LruDelete(File file)
* to leak the FD than to mess up our internal state.
*/
- if (close(vfdP->fd) != 0)
+ if (dup2(2, vfdP->fd) != vfdP->fd)
elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
"could not close file \"%s\": %m", vfdP->fileName);
The same "make check" w/ TEMP_CONFIG io_method=io_uring passes with the
combination of that and the max-turnover change to ReleaseLruFiles().
> I don't know if that's an intentional or unintentional behavioral difference.
>
> There are 2 1/2 ways around this:
>
> 1) Stop using IOSQE_ASYNC heuristic
> 2a) Wait for all in-flight IOs when any FD gets closed
> 2b) Wait for all in-flight IOs using FD when it gets closed
>
> Given that we have clear evidence that io_uring doesn't completely support
> closing FDs while IOs are in flight, be it a bug or intentional, it seems
> clearly better to go for 2a or 2b.
Agreed. If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other. I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a). What other considerations could be important?
From | Date | Subject | |
---|---|---|---|
Next Message | Corey Huinker | 2025-03-25 19:59:13 | Re: Statistics Import and Export |
Previous Message | Sami Imseih | 2025-03-25 19:33:26 | Re: Squash constant lists in query jumbling by default |