From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-29 03:35:23 |
Message-ID: | ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-03-28 08:54:42 -0400, Andres Freund wrote:
> On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > > - We need to register a local callback for shared buffer reads, which don't
> > > need them today . That's a small bit of added overhead. It's a shame to do
> > > so for counters that approximately never get incremented.
> >
> > Fair concern. An idea is to let the complete_shared callback change the
> > callback list associated with the IO, so it could change
> > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
> > latter would differ from the former only in having the extra local callback.
> > Could that help? I think the only overhead is using more PGAIO_HCB numbers.
>
> I think changing the callback could work - I'll do some measurements in a
> coffee or two, but I suspect the overhead is not worth being too worried about
> for now. There's a different aspect that worries me slightly more, see
> further down.
> ...
> Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
> local hashtable. That's more expensive than an indirect function call
> (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
> not we can apply a mini-cache for the current database, which is surely the
> only thing that ever matters for performance.
I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a
difference of either having the unnecessary complete_local callback or having
the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are
a few hits inside pgstat_get_entry_ref(), but not enough to matter.
Hence I think this isn't worth worrying about, at least for now. I think we
have far bigger fish to fry at this point than such a small performance
difference.
I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not
mention the overhead. I'm still inclined to think that it's better to call it
in the shared completion callback.
I also fixed support and added tests for ignore_checksum_failure, that also
needs to be determined at the start of the IO, not in the completion. Once
more there were no tests, of course.
I spent the last 6 hours on the stupid error/warning messages around this,
somewhat ridiculous.
The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery. The same buffer could both be ignored *and* zeroed. Or somebody
could use ignore_checksum_failure=on but then still encounter a page that is
invalid.
But I finally got to a point where the code ends up readable, without undue
duplication. It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.
Here's the relevant code:
/*
* Treat a read that had both zeroed buffers *and* ignored checksums as a
* special case, it's too irregular to be emitted the same way as the other
* cases.
*/
if (zeroed_any && ignored_any)
{
Assert(zeroed_any && ignored_any);
Assert(nblocks > 1); /* same block can't be both zeroed and ignored */
Assert(result.status != PGAIO_RS_ERROR);
affected_count = zeroed_or_error_count;
ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
affected_count, checkfail_count, first, last, rpath.str),
affected_count > 1 ?
errdetail("Block %u held first zeroed page.",
first + first_off) : 0,
errhint("See server log for details about the other %u invalid blocks.",
affected_count + checkfail_count - 1));
return;
}
/*
* The other messages are highly repetitive. To avoid duplicating a long
* and complicated ereport(), gather the translated format strings
* separately and then do one common ereport.
*/
if (result.status == PGAIO_RS_ERROR)
{
Assert(!zeroed_any); /* can't have invalid pages when zeroing them */
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s");
msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first invalid page.");
hint_mult = _("See server log for the other %u invalid blocks.");
}
else if (zeroed_any && !ignored_any)
{
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s; zeroing out page");
msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first zeroed page.");
hint_mult = _("See server log for the other %u zeroed blocks.");
}
else if (!zeroed_any && ignored_any)
{
affected_count = checkfail_count;
msg_one = _("ignoring checksum failure in block %u of relation %s");
msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
det_mult = _("Block %u held first ignored page.");
hint_mult = _("See server log for the other %u ignored blocks.");
}
else
pg_unreachable();
ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
affected_count == 1 ?
errmsg_internal(msg_one, first + first_off, rpath.str) :
errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);
Does that approach make sense?
What do you think about using
"zeroing invalid page in block %u of relation %s"
instead of
"invalid page in block %u of relation %s; zeroing out page"
I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Alexey Makhmutov | 2025-03-29 03:42:43 | High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues |
Previous Message | Corey Huinker | 2025-03-29 01:11:05 | Re: Statistics Import and Export |