From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: Streaming I/O, vectored I/O (WIP) |
Date: | 2023-12-08 18:25:54 |
Message-ID: | 20231208182554.swereaxcgvq72ts6@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > On 29/11/2023 21:39, Thomas Munro wrote:
> > > One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
> > > callers believe that short writes set errno: they error out with a
> > > message including %m. We have historically set errno = ENOSPC inside
> > > FileWrite() if the write size was unexpectedly small AND the kernel
> > > didn't set errno to a non-zero value (having set it to zero ourselves
> > > earlier). In FileWriteV(), I didn't want to do that because it is
> > > expensive to compute the total write size from the vector array and we
> > > managed to measure an effect due to that in some workloads.
> > >
> > > Note that the smgr patch actually handles short writes by continuing,
> > > instead of raising an error. Short writes do already occur in the
> > > wild on various systems for various rare technical reasons other than
> > > ENOSPC I have heard (imagine transient failure to acquire some
> > > temporary memory that the kernel chooses not to wait for, stuff like
> > > that, though certainly many people and programs believe they should
> > > not happen[1]), and it seems like a good idea to actually handle them
> > > as our write sizes increase and the probability of short writes might
> > > presumably increase.
> >
> > Maybe we should bite the bullet and always retry short writes in
> > FileWriteV(). Is that what you meant by "handling them"?
> > If the total size is expensive to calculate, how about passing it as an
> > extra argument? Presumably it is cheap for the callers to calculate at
> > the same time that they build the iovec array?
>
> It's cheap for md.c, because it already has nblocks_this_segment.
> That's one reason I chose to put the retry there. If we push it down
> to fd.c in order to be able to help other callers, you're right that
> we could pass in the total size (and I guess assert that it's
> correct), but that is sort of annoyingly redundant and further from
> the interface we're wrapping.
> There is another problem with pushing it down to fd.c, though.
> Suppose you try to write 8192 bytes, and the kernel says "you wrote
> 4096 bytes" so your loop goes around again with the second half the
> data and now the kernel says "-1, ENOSPC". What are you going to do?
> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> so you'd either have to return -1, ENOSPC (converting short writes
> into actual errors, a lie because you did write some data), or return
> 4096 (and possibly also set errno = ENOSPC as we have always done).
> So you can't really handle this problem at this level, can you?
> Unless you decide that fd.c should get into the business of raising
> errors for I/O failures, which would be a bit of a departure.
>
> That's why I did the retry higher up in md.c.
I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2023-12-08 18:30:18 | Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans) |
Previous Message | Tom Lane | 2023-12-08 18:23:50 | Re: backtrace_on_internal_error |