Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at>
Subject: Re: AIO v2.5
Date: 2025-03-30 00:39:54
Message-ID: r4pkraozkf77o35jwhcgj3ljp7v6eww7la3mj3q6osscftzatp@duypazgcphvl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:
> Flushing half-baked review comments before going offline for a few hours:
>
> On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:
> > Attached v2.13, with the following changes:
>
> > 5) The WARNING in the callback is now a LOG, as it will be sent to the
> > client as a WARNING explicitly when the IO's results are processed
> >
> > I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
> > LOG? But not at all sure.
>
> LOG_SERVER_ONLY and its synonym COMMERR look to be used for:
>
> - ProcessLogMemoryContextInterrupt()
> - messages before successful authentication
> - protocol sync loss, where we'd fail to send a client message
> - client already gone
>
> The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
> If a client has decided to set client_min_messages that high, the client might
> be interested in the fact that it got side-tracked completing someone else's
> IO. On the other hand, almost none of those sidetrack events will produce
> messages. The main argument I'd envision for LOG_SERVER_ONLY is that we
> consider the message content sensitive, but I don't see the message content as
> materially sensitive.

I don't think it's sensitive - it just seems a bit sillier to send the same
thing to the client twice than to the server log. I'm happy to change it to
LOG if you prefer. Your points below mean some comments need to be updated in
smgr/md.c anyway.

> > - Previously the buffer completion callback checked zero_damaged_pages - but
> > that's not right, the GUC hopefully is only set on a per-session basis
>
> Good catch. I've now audited the complete_shared callbacks for other variable
> references and actions not acceptable there. I found nothing beyond what you
> found by v2.14.

I didn't find anything else either.

> > Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd
>
> > + /*
> > + * Immediately log a message about the IO error, but only to the
> > + * server log. The reason to do so immediately is that the originator
> > + * might not process the query result immediately (because it is busy
> > + * doing another part of query processing) or at all (e.g. if it was
> > + * cancelled or errored out due to another IO also failing). The
> > + * issuer of the IO will emit an ERROR when processing the IO's
>
> s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other
> places in the patches.

Hm. Will do. Doesn't bother me personally, but happy to change it.

> > +/*
> > + * smgrstartreadv() -- asynchronous version of smgrreadv()
> > + *
> > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
> > + * `ioh` all parameters are the same as smgrreadv().
>
> I would add a comment starting with:
>
> Compared to smgrreadv(), more responsibilities fall on layers above smgr.
> Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some
> problems, but higher layers are responsible for pgaio_result_report() to
> mirror that news to the user and (for ERROR) abort the (sub)transaction.

Hm - if we document that in all the smgrstart* we'd end up with something like
that in a lot of places - but OTOH, this is the first one so far...

> I say "comment starting with", because I think there's a remaining decision
> about who owns the zeroing currently tied to smgrreadv(). An audit of
> mdreadv() vs. AIO counterparts found this part of mdreadv():
>
> if (nbytes == 0)
> {
> /*
> * We are at or past EOF, or we read a partial block at EOF.
> * Normally this is an error; upper levels should never try to
> * read a nonexistent block. However, if zero_damaged_pages
> * is ON or we are InRecovery, we should instead return zeroes
> * without complaining. This allows, for example, the case of
> * trying to update a block that was later truncated away.
> */
> if (zero_damaged_pages || InRecovery)
> {
>
> I didn't write a test to prove its absence, but I'm not finding such code in
> the AIO path.

Yes, there is no such codepath

A while ago I had started a thread about whether the above codepath is
necessary, as the whole idea of putting a buffer into shared buffers that
doesn't exist on-disk is *extremely* ill conceived, it puts a buffer into
shared buffer that somehow wasn't readable on disk, *without* creating it on
disk. The problem is that an mdnblocks() wouldn't know about that
only-in-memory part of the relation and thus most parts of PG won't consider
that buffer to exist - it'd be just skipped in sequential scans etc, but then
it'd trigger errors when extending the relation ("unexpected data beyond
EOF"), etc.

I had planned to put in an error into mdreadv() at the time, but somehow lost
track of that - I kind of mentally put this issue into the "done" category :(

Based on my research, the InRecovery path is not reachable (most recovery
buffer reads go through XLogReadBufferExtended() which extends at that layer
files, and the exceptions like VM/FSM have explicit code to extend the
relation, c.f. vm_readbuf()). It actually looks to me like it *never* was
reachable, the XLogReadBufferExtended() predecessors, back to the initial
addition of WAL to PG, had that such an extension path, as did vm/fsm.

The zero_damaged_pages path hasn't reliably worked for a long time afaict,
because _mdfd_getseg() doesn't know about it (note we're not passing
EXTENSION_CREATE). So unless the buffer is just after the physical end of the
last segment, you'll just get an error at that point. To my knowledge we
haven't heard related complaints.

It makes some sense to have zero_damaged_pages for actually existing pages
reached from sequential / tid / COPY on the table level - after all that's the
only way you might get data out during data recovery. But those would never
reach this logic, as such scans rely on mdnblocks(). For index -> heap
fetches the option seems mainly dangerous, because that'll just create random
buffers in shared buffers that, as explained above, won't then be reached by
other scans. And index scans during data recovery are not a good idea in the
first place, all that one should do in that situation is to dump out the data.

At the very least we need to add a comment about this though. If we want to
implement it, it'd be easy enough, but I think that logic is so insane that I
think we shouldn't do it unless there is some *VERY* clear evidence that we
need it.

> I wondered if we could just Assert(!InRecovery), but adding that to
> md_readv_complete() failed 001_stream_rep.pl with this stack:

I'd expect that to fail in a lot of paths:
XLogReadBufferExtended() ->
ReadBufferWithoutRelcache() ->
ReadBuffer_common() ->
StartReadBuffer()

> > Subject: [PATCH v2.14 04/29] aio: Add pg_aios view
>
> > + /*
> > + * There is no lock that could prevent the state of the IO to advance
> > + * concurrently - and we don't want to introduce one, as that would
> > + * introduce atomics into a very common path. Instead we
> > + *
> > + * 1) Determine the state + generation of the IO.
> > + *
> > + * 2) Copy the IO to local memory.
> > + *
> > + * 3) Check if state or generation of the IO changed. If the state
> > + * changed, retry, if the generation changed don't display the IO.
> > + */
> > +
> > + /* 1) from above */
> > + start_generation = live_ioh->generation;
> > + pg_read_barrier();
>
> I think "retry:" needs to be here, above start_state assignment. Otherwise,
> the "live_ioh->state != start_state" test will keep seeing a state mismatch.

Damn, you're right.

> > Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
> > Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
> > Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
> > critical sections
> > Subject: [PATCH v2.14 09/29] Add errhint_internal()
>
> Ready for commit

Cool

> > Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support
>
> > Buffer reads executed this infrastructure will report invalid page / checksum
> > errors / warnings differently than before:
>
> s/this/through this/

Fixed.

> > + *zeroed_or_error_count = rem_error & ((1 << 7) - 1);
> > + rem_error >>= 7;
>
> These raw "7" are good places to use your new #define values. Likewise in
> buffer_readv_encode_error().

Which define value are you thinking of here? I don't think any of the ones I
added apply? But I think you're right it'd be good to have some define for
it, at least locally.

> > + * that was errored or zerored or, if no errors/zeroes, the first ignored
>
> s/zerored/zeroed/
>
> > + * enough. If there is an error, the error is the integeresting offset,
>
> typo "integeresting"

:(. Fixed.

> > +/*
> > + * We need a backend-local completion callback for shared buffers, to be able
> > + * to report checksum errors correctly. Unfortunately that can only safely
> > + * happen if the reporting backend has previously called
>
> Missing end of sentence.

It's now:

/*
* We need a backend-local completion callback for shared buffers, to be able
* to report checksum errors correctly. Unfortunately that can only safely
* happen if the reporting backend has previously called
* pgstat_prepare_report_checksum_failure(), which we can only guarantee in
* the backend that started the IO. Hence this callback.
*/

> > @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
> > */
>
> There's an outdated comment ending here:
>
> /*
> * Throw a WARNING if the checksum fails, but only after we've checked for
> * the all-zeroes case.
> */

Updated to:
/*
* Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails,
* but only after we've checked for the all-zeroes case.
*/

I found one more, the newly added comment about checksum_failure_p was still
talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE.

> > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
> > ignore_checksum_failure
> > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
> > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
> > support
> > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
> > wait time
> > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems
>
> Ready for commit

Cool.

> > Subject: [PATCH v2.14 18/29] aio: Add test_aio module
>
> I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in
> my queue.

That's good - I think some of the tests need to expand a bit more. Since
that's at the end of the dependency chain...

> One thing I noticed anyway:

> > +# Tests using injection points. Mostly to exercise had IO errors that are
>
> s/had/hard/

Fixed.

> On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:
> > On 2025-03-29 10:48:10 -0400, Andres Freund wrote:
> > > Attached is v2.14:
>
> > > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above)
> >
> > I think I found an issue with this one - as it stands the view was viewable by
> > everyone. While it doesn't provide a *lot* of insight, it still seems a bit
> > too much for an unprivileged user to learn what part of a relation any other
> > user is currently reading.
> >
> > There'd be two different ways to address that:
> > 1) revoke view & function from public, grant to a limited role (presumably
> > pg_read_all_stats)
> > 2) copy pg_stat_activity's approach of using something like
> >
> > #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
> >
> > on a per-IO basis.
>
> No strong opinion.

Same.

> I'm not really worried about any of this information leaking. Nothing in
> pg_aios comes close to the sensitivity of pg_stat_activity.query.
> pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type
> that I wouldn't worry about. Hence, another valid choice is (3) change
> nothing.

I'd also be on board with that.

> Meanwhile, I see substantially less need to monitor your own IOs than to
> monitor your own pg_stat_activity rows, and even your own IOs potentially
> reveal things happening in other sessions, e.g. evicting buffers that others
> read and you never read. So restrictions wouldn't be too painful, and (1)
> arguably helps privacy more than (2).
>
> I'd likely go with (1) today.

Sounds good to me. It also has the advantage of being much easier to test than
2).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2025-03-30 00:42:52 Re: Using read stream in autoprewarm
Previous Message Michael Zhilin 2025-03-29 22:57:28 [PATCH] Avoid useless prefetches in case of recent FPI WAL records