From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-28 12:54:42 |
Message-ID: | wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > Don't report the error in the completion callback. The obvious place would be
> > to do it where we we'll raise the warning/error in the issuing process. The
> > big disadvantage is that that that could lead to under-counting checksum
> > errors:
> >
> > a) A read stream does 2+ concurrent reads for the same relation, and more than
> > one encounters checksum errors. When processing the results for the first
> > failed read, we raise an error and thus won't process the results of the
> > second+ reads with errors.
> >
> > b) A read is started asynchronously, but before the backend gets around to
> > processing the result of the IO, it errors out during that other work
> > (possibly due to a cancellation). Because the backend never looked at the
> > results of the IOs, the checksum errors don't get accounted for.
> >
> > b) doesn't overly bother me, but a) seems problematic.
>
> While neither are great, I could live with both. I guess I'm optimistic that
> clusters experiencing checksum failures won't lose enough reports to these
> loss sources to make the difference in whether monitoring catches them. In
> other words, a cluster will report N failures without these losses and N-K
> after these losses. If N is large enough for relevant monitoring to flag the
> cluster appropriately, N-K will also be large enough.
That's true.
> > Approach III)
> >
> > Accumulate checksum errors in two backend local variables (one for database
> > specific errors, one for errors on shared relations), which will be flushed by
> > the backend that issued IO during the next pgstat_report_start().
FWIW, two variables turn out to not quite suffice - as I realized later, we
actually can issue IO on behalf of arbitrary databases, due to
ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer().
That unfortunately makes it much harder to be able to guarantee that the
completor of an IO has the DSM segment for a pg_stat_database stats entry
mapped.
> > - We need to register a local callback for shared buffer reads, which don't
> > need them today . That's a small bit of added overhead. It's a shame to do
> > so for counters that approximately never get incremented.
>
> Fair concern. An idea is to let the complete_shared callback change the
> callback list associated with the IO, so it could change
> PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
> latter would differ from the former only in having the extra local callback.
> Could that help? I think the only overhead is using more PGAIO_HCB numbers.
I think changing the callback could work - I'll do some measurements in a
coffee or two, but I suspect the overhead is not worth being too worried about
for now. There's a different aspect that worries me slightly more, see
further down.
> We currently reserve 256 (uint8), but one could imagine trying to pack into
> fewer bits.
Yea, my current local worktree reduces it to 6 bits for now, to make space for
keeping track of the number of checksum failures in error data (as part of
that adds defines for the bit widths). If that becomes an issue we can make
PgAioResult wider, but I suspect that won't be too soon.
One simplification that we could make is to only ever report one checksum
failure for each IO, even if N buffers failed - after all that's what HEAD
does (by virtue of throwing an error after the first). Then we'd not track the
number of checksum errors.
> That said, this wouldn't paint us into a corner. We could change the
> approach later.
Indeed - I think we mainly need something that works for now. I think medium
term the right fix here would be to make sure that the stats can be accounted
for with just an atomic increment somewhere.
We've had several discussions around having an in-memory datastructure for
every relation that currently has buffer in shared_buffers, to store e.g. the
relation length and the sync requests. If we get that, I think Thomas has a
prototype, we can accumulate the number of checksum errors in there, for
example. It'd also allow to address the biggest blocker for writes, namely
that RememberSyncRequest() could fail, *after* IO comletion.
> pgaio_io_call_complete_local() starts a critical section. Is that a problem
> for this approach?
I think we can make it not a problem - I added a
pgstat_prepare_report_checksum_failure(dboid) that ensures the calling backend
has a reference to the relevant shared memory stats entry. If we make the rule
that it has to be called *before* starting buffered IO (i.e. in
AsyncReadBuffers()), we can be sure the stats reference still exists by the
time local completion runs (as the isn't a way to have the stats entry dropped
without dropping the database, which isn't possible while a) the database
still is connected to, for normal IO b) the CREATE DATABASE is still running).
Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
local hashtable. That's more expensive than an indirect function call
(i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
not we can apply a mini-cache for the current database, which is surely the
only thing that ever matters for performance.
> > Approach IV):
> >
> > Embracy piercing abstractions / generic infrastructure and put two atomic
> > variables (one for shared one for the backend's database) in some
> > backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
> > update that in the completion callback. Flush that variable to the shared
> > stats in pgstat_report_start() or such.
>
> I could live with that. I feel better about Approach III currently, though.
> Overall, I'm feeling best about III long-term, but II may be the right
> tactical choice.
I think it's easy to change between these approaches. Both require that we
encode the number of checksum failures in the result, which is where most of
the complexity lies (but still a rather surmountable amount of complexity).
> I think no, but here are some ideas I tossed around:
>
> - Like your Approach III, but have the completing process store the count
> locally and flush it, instead of the staging process doing so. Would need
> more than 2 slots, but we could have a fixed number of slots and just
> discard any reports that arrive with all slots full. Reporting checksum
> failures in, say, 8 databases in quick succession probably tells the DBA
> there's "enough corruption to start worrying". Missing the 9th database
> would be okay.
Yea. I think that'd be an ok fallback, but if we can make III' work, it'd be
nicer.
> - Pre-warm the memory allocations and DSAs we could possibly need, so we can
> report those stats in critical sections, from the completing process. Bad
> since there's an entry per database, hence no reasonable limit on how much
> memory a process might need to pre-warm. We could even end up completing an
> IO for a database that didn't exist on entry to our critical section.
I experimented with this one - it works surprisingly well, because for IO
workers we could just do the pre-warming outside of the critical section, and
it's *exceedingly* rare that any other completor would ever need to complete
IO for another database than the current / a shared relation.
But it does leave a nasty edge case, that we'd just have to accept. I guess we
could just make it so that in that case stats aren't reported.
But it seems pretty ugly.
> This email isn't as well-baked as I like, but the alternative was delaying it
> 24-48h depending on how other duties go over those hours. My v2.13 review is
> still in-progress, too.
It's appreciated!
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-03-28 12:57:25 | Re: AIO v2.5 |
Previous Message | Alena Rybakina | 2025-03-28 12:31:31 | Re: POC, WIP: OR-clause support for indexes |