Re: Streaming I/O, vectored I/O (WIP)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2023-12-09 00:41:31
Message-ID: CA+hUKGJLW3pwf7yWWCkoLD+kGb_J3B+Q-oa9ufq82_HAsW0QxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > > Maybe we should bite the bullet and always retry short writes in
> > > FileWriteV(). Is that what you meant by "handling them"?
> > > If the total size is expensive to calculate, how about passing it as an
> > > extra argument? Presumably it is cheap for the callers to calculate at
> > > the same time that they build the iovec array?

> > There is another problem with pushing it down to fd.c, though.
> > Suppose you try to write 8192 bytes, and the kernel says "you wrote
> > 4096 bytes" so your loop goes around again with the second half the
> > data and now the kernel says "-1, ENOSPC". What are you going to do?
> > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> > so you'd either have to return -1, ENOSPC (converting short writes
> > into actual errors, a lie because you did write some data), or return
> > 4096 (and possibly also set errno = ENOSPC as we have always done).
> > So you can't really handle this problem at this level, can you?
> > Unless you decide that fd.c should get into the business of raising
> > errors for I/O failures, which would be a bit of a departure.
> >
> > That's why I did the retry higher up in md.c.
>
> I think that's the right call. I think for AIO we can't do retry handling
> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
> buy us that much in md.c anyway, we still need to handle the cross segment
> case and such, from what I can tell?

Heikki, what do you think about this: we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno. Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful? I wouldn't want them for
md.c though because I already know the size.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-12-09 02:29:17 Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)
Previous Message Tom Lane 2023-12-09 00:39:20 Re: backtrace_on_internal_error