Re: AIO v2.5

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at>
Subject: Re: AIO v2.5
Date: 2025-03-28 03:22:23
Message-ID: 20250328032223.34.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> I now wrote some tests. And I both regret doing so (because it found problems,
> which would have been apparent long ago, if the feature had come with *any*
> tests, if I had gone the same way I could have just pushed stuff) and am glad
> I did (because I dislike pushing broken stuff).
>
> I have to admit, I was tempted to just ignore this issue and just not say
> anything about tests for checksum failures anymore.

I don't blame you.

> 3) We can't pgstat_report_checksum_failure() during the completion callback,
> as it *sometimes* allocates memory
>
> Aside from the allocation-in-critical-section asserts, I think this is
> *extremely* unlikely to actually cause a problem in practice. But we don't
> want to rely on that, obviously.

> Addressing 3) is not at all trivial. Here's what I've thought of so far:
>
>
> Approach I)

> Unfortunately that fails because to access the shared memory with the stats
> data we need to do dsa_get_address()

> Approach II)
>
> Don't report the error in the completion callback. The obvious place would be
> to do it where we we'll raise the warning/error in the issuing process. The
> big disadvantage is that that that could lead to under-counting checksum
> errors:
>
> a) A read stream does 2+ concurrent reads for the same relation, and more than
> one encounters checksum errors. When processing the results for the first
> failed read, we raise an error and thus won't process the results of the
> second+ reads with errors.
>
> b) A read is started asynchronously, but before the backend gets around to
> processing the result of the IO, it errors out during that other work
> (possibly due to a cancellation). Because the backend never looked at the
> results of the IOs, the checksum errors don't get accounted for.
>
> b) doesn't overly bother me, but a) seems problematic.

While neither are great, I could live with both. I guess I'm optimistic that
clusters experiencing checksum failures won't lose enough reports to these
loss sources to make the difference in whether monitoring catches them. In
other words, a cluster will report N failures without these losses and N-K
after these losses. If N is large enough for relevant monitoring to flag the
cluster appropriately, N-K will also be large enough.

> Approach III)
>
> Accumulate checksum errors in two backend local variables (one for database
> specific errors, one for errors on shared relations), which will be flushed by
> the backend that issued IO during the next pgstat_report_start().
>
> Two disadvantages:
>
> - Accumulation of errors will be delayed until the next
> pgstat_report_start(). That seems acceptable, after all we do so far a lot
> of other stats.

Yep, acceptable.

> - We need to register a local callback for shared buffer reads, which don't
> need them today . That's a small bit of added overhead. It's a shame to do
> so for counters that approximately never get incremented.

Fair concern. An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
latter would differ from the former only in having the extra local callback.
Could that help? I think the only overhead is using more PGAIO_HCB numbers.
We currently reserve 256 (uint8), but one could imagine trying to pack into
fewer bits. That said, this wouldn't paint us into a corner. We could change
the approach later.

pgaio_io_call_complete_local() starts a critical section. Is that a problem
for this approach?

> Approach IV):
>
> Embracy piercing abstractions / generic infrastructure and put two atomic
> variables (one for shared one for the backend's database) in some
> backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
> update that in the completion callback. Flush that variable to the shared
> stats in pgstat_report_start() or such.

I could live with that. I feel better about Approach III currently, though.
Overall, I'm feeling best about III long-term, but II may be the right
tactical choice.

> Does anybody have better ideas?

I think no, but here are some ideas I tossed around:

- Like your Approach III, but have the completing process store the count
locally and flush it, instead of the staging process doing so. Would need
more than 2 slots, but we could have a fixed number of slots and just
discard any reports that arrive with all slots full. Reporting checksum
failures in, say, 8 databases in quick succession probably tells the DBA
there's "enough corruption to start worrying". Missing the 9th database
would be okay.

- Pre-warm the memory allocations and DSAs we could possibly need, so we can
report those stats in critical sections, from the completing process. Bad
since there's an entry per database, hence no reasonable limit on how much
memory a process might need to pre-warm. We could even end up completing an
IO for a database that didn't exist on entry to our critical section.

- Skip the checksum pgstats if we're completing in a critical section.
Doesn't work since we _always_ make a critical section to complete I/O.

This email isn't as well-baked as I like, but the alternative was delaying it
24-48h depending on how other duties go over those hours. My v2.13 review is
still in-progress, too.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2025-03-28 03:24:14 Re: pg_stat_database.checksum_failures vs shared relations
Previous Message jian he 2025-03-28 03:09:15 Re: speedup COPY TO for partitioned table.