Re: AIO v2.5

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-25 14:11:20
Message-ID: 20250325141120.8e.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> On 2025-03-24 19:20:37 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > +static pg_attribute_always_inline PgAioResult
> > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > > + bool failed, bool is_temp)
> > > +{
> > ...
> > > + if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
> > > + {
> > > + ereport(WARNING,
> > > + (errcode(ERRCODE_DATA_CORRUPTED),
> > > + errmsg("invalid page in block %u of relation %s; zeroing out page",
> >
> > My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
> > in full depth. In the !is_temp case, this runs in a complete_shared
> > callback. A process unrelated to the original IO may run this callback.
> > That's unfortunate in two ways. First, that other process's client gets an
> > unexpected WARNING. The process getting the WARNING may not even have
> > zero_damaged_pages enabled. Second, the client of the process that staged
> > the IO gets no message.
>
> Ah, right. That could be why I had flipped it. If so, shame on me for not
> adding a comment...
>
>
> > AIO ERROR-level messages handle this optimally. We emit a LOG-level message
> > in the process that runs the complete_shared callback, and we arrange for the
> > ERROR-level message in the stager. That would be ideal here: LOG in the
> > complete_shared runner, WARNING in the stager.
>
> We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
> some ways, but not others) the message when run in a different backend fairly
> easily. Still emitting a WARNING in the stager however is a bit more tricky.
>
> Before thinking more deeply about how we could emit WARNING in the stage:
>
> Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
> trigger a rather massive flood of messages to the client in a *normal*
> situation. I'm thinking of something like an insert extending a relation some
> time after an immediate restart and encountering a lot of FSM corruption (due
> to its non-crash-safe-ness) during the search for free space and the
> subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of
> WARNINGs to the client seems not quite right.

Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
ZERO_ON_ERROR. The ZERO_ON_ERROR case also should not use
ERRCODE_DATA_CORRUPTED. (That errcode shouldn't appear for business as usual.
It should signify wrong or irretrievable query results, essentially.)

For zero_damaged_pages, WARNING seems at least defensible, and
ERRCODE_DATA_CORRUPTED is right. It wouldn't be the worst thing to change
zero_damaged_pages to LOG and let the complete_shared runner log it, as long
as we release-note that. It's superuser-only, and the superuser can learn to
check the log. One typically should use zero_damaged_pages in one session at
a time, so the logs won't be too confusing.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

> If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
> then could tell the stager to issue the WARNING. It would add a bit of
> distributed cost, both to callbacks and users of AIO, but it might not be too
> bad.
>
>
> > One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
> > zero_damaged_pages, perhaps as a short-term approach.
>
> Yea, that could work. Perhaps even just for zero_damaged_pages, after
> changing it so that ZERO_ON_ERROR always just LOGs.

Yes.

> Hm, it seems somewhat nasty to have rather different performance
> characteristics when forced to use zero_damaged_pages to recover from a
> problem. Imagine an instance that's configured to use DIO and then needs to
> use zero_damaged_pages to recove from corruption...

True. I'd be willing to bet high-scale use of zero_damaged_pages is rare. By
high scale, I mean something like reading a whole large table, as opposed to a
TID scan of the known-problematic range. That said, people (including me)
expect the emergency tools to be good even if they're used rarely. You're not
wrong to worry about it.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-03-25 14:14:08 Re: Enhancing Memory Context Statistics Reporting
Previous Message Peter Eisentraut 2025-03-25 13:39:17 Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints