Quick Links

Re: AIO v2.5

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at>
Subject:	Re: AIO v2.5
Date:	2025-03-27 20:58:11
Message-ID:	5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-03-26 21:07:40 -0400, Andres Freund wrote:
> TODO
> ...
> - Add an explicit test for the checksum verification in the completion callback
>
> There is an existing test for testing an invalid page due to page header
> verification in test_aio, but not for checksum failures.
>
> I think it's indirectly covered (e.g. in amcheck), but seems better to test
> it explicitly.

Ah, for crying out loud. As it turns out, no, we do not have *ANY* tests for
this for the server-side. Not a single one. I'm somewhat apoplectic,
data_checksums is a really complicated feature, which we just started *turning
on by default*, without a single test of the failure behaviour, when detecting
failures is the one thing the feature is supposed to do.

I now wrote some tests. And I both regret doing so (because it found problems,
which would have been apparent long ago, if the feature had come with *any*
tests, if I had gone the same way I could have just pushed stuff) and am glad
I did (because I dislike pushing broken stuff).

I have to admit, I was tempted to just ignore this issue and just not say
anything about tests for checksum failures anymore.

Problems:

1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we
don't want to emit it in a) io workers b) another backend if it completes
the error.

This isn't hard to address, we can add PIV_LOG_LOG (or something like that)
to emit it at a different log level and an out-parameter to trigger sending
a warning / adjust the warning/error message we already emit once the
issuer completes the IO.

2) With IO workers (and "foreign completors", in rare cases), the checksum
failures would be attributed wrongly, as it reports all stats to
MyDatabaseId

As it turns out, this is already borked on master for shared relations,
since pg_stat_database.checksum_failures has existed, see [1].

This isn't too hard to fix, if we adjust the signature to
PageIsVerifiedExtended() to pass in the database oid. But see also 3)

3) We can't pgstat_report_checksum_failure() during the completion callback,
as it *sometimes* allocates memory

Aside from the allocation-in-critical-section asserts, I think this is
*extremely* unlikely to actually cause a problem in practice. But we don't
want to rely on that, obviously.

Addressing the first two is pretty simple and/or needs to be done anyway,
since it's a currently existing bug, as discussed in [1].

Addressing 3) is not at all trivial. Here's what I've thought of so far:

Approach I)

My first thoughts were around trying to make the relevant pgstat
infrastructure either not need to allocate memory, or handle memory
allocation failures gracefully.

Unfortunately that seems not really viable:

The most successful approach I tried was to report stats directly to the
dshash table, and only report stats if there's already an entry (which
there just about always will, except for a very short period after stats
have been reset).

Unfortunately that fails because to access the shared memory with the stats
data we need to do dsa_get_address(), which can fail if the relevant dsm
segment wasn't already mapped in the current process (it allocates memory
in the process of mapping in the segment). There's no API to do that
without erroring out.

That aspect rules out a number of other approaches that sounded like they
could work - we e.g. could increase the refcount of the relevant pgstat
entry before issuing IO, ensuring that it's around by the time we need to
report. But that wouldn't get around the issue of needing to map in the dsm
segment.

Approach II)

Don't report the error in the completion callback. The obvious place would be
to do it where we we'll raise the warning/error in the issuing process. The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.

Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

Two disadvantages:

- Accumulation of errors will be delayed until the next
pgstat_report_start(). That seems acceptable, after all we do so far a lot
of other stats.

- We need to register a local callback for shared buffer reads, which don't
need them today . That's a small bit of added overhead. It's a shame to do
so for counters that approximately never get incremented.

Approach IV):

Embracy piercing abstractions / generic infrastructure and put two atomic
variables (one for shared one for the backend's database) in some
backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
update that in the completion callback. Flush that variable to the shared
stats in pgstat_report_start() or such.

This would avoid the need for the local completion callback, and would also
allow to introduce a function see the number "unflushed" checksum errors. It
also doesn't require transporting the number of errors between the shared
callback and the local callback - but we might want to do have that for the
error message anyway.

I wish the new-to-18 pgstat_backend() were designed in a way to make this
possible in a nice way. But unfortunately it puts the backend-specific data in
the dshash table / dynamic shared memory, rather than in a MaxBackends +
NUM_AUX sized array array in plain shared memory. As explained in I), we can't
rely on having the entire array mapped. Leaving the issue from this email
aside, that also adds a fair bit of overhead to other cases.

Does anybody have better ideas?

I think II), III) and IV) are all relatively simple to implement.

The most complicated bit is that a bit of bit-squeezing is necessary to fit
the number of checksum errors (in addition to the number of otherwise invalid
pages) into the available space for error data. It's doable. We could also
just increase the size of PgAioResult.

I've implemented II), but I'm not sure the disadvantages are acceptable.

Greetings,

Andres Freund

[1] https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg%40e4gt7npsohuz

In response to

Re: AIO v2.5 at 2025-03-27 01:07:40 from Andres Freund

Responses

Re: AIO v2.5 at 2025-03-28 03:22:23 from Noah Misch

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2025-03-27 21:15:57	Re: Non-text mode for pg_dumpall
Previous Message	Matheus Alcantara	2025-03-27 20:56:39	Re: read stream on amcheck