AIO writes vs hint bits vs checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: AIO writes vs hint bits vs checksums
Date: 2024-09-24 15:55:08
Message-ID: stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently we modify pages while just holding a share lock, for hint bit
writes. Writing a buffer out only requires a share lock. Because of that we
can't compute checksums and write out pages in-place, as a concurent hint bit
write can easily corrupt the checksum.

That's not great, but not awful for our current synchronous buffered IO, as
we only ever have a single page being written out at a time.

However, it becomes a problem even if we just want to write out in chunks
larger than a single page - we'd need to reserve not just one BLCKSZ sized
buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still
tolerable.

With AIO this becomes a considerably bigger issue:
a) We can't just have one write in progress at a time, but many
b) To be able to implement AIO using workers the "source" or "target" memory
of the IO needs to be in shared memory

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

Which lead me down the path of trying to avoid the need for the copy in the
first place: What if we don't modify pages while it's undergoing IO?

The naive approach would be to not set hint bits with just a shared lock - but
that doesn't seem viable at all. For performance we rely on hint bits being
set and in many cases we'll only encounter the page in shared mode. We could
implement a conditional lock upgrade to an exclusive lock and do so while
setting hint bits, but that'd obviously be concerning from a concurrency point
of view.

What I suspect we might want instead is something inbetween a share and an
exclusive lock, which is taken while setting a hint bit and which conflicts
with having an IO in progress.

On first blush it might sound attractive to introduce this on the level of
lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more
complicated which would imply overhead for other users, while the precise
semantics would be fairly specific to buffer locking. A variant of this would
be to generalize lwlock.c to allow implementing different kinds of locks more
easily. But that's a significant project on its own and doesn't really seem
necessary for this specific project.

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Another issue is that we'll often WAL log hint bits for a page (due to hint
bits being set), just to then immediately log another WAL record for the same
page (e.g. for pruning), which is obviously wasteful. With a different
interface we could combine the WAL records for both.

I've not prototyped either, but I'm fairly confident they'd be helpful.

Does this sound like a reasonable idea? Counterpoints? If it does sound
reasonable, I'll clean up my pile of hacks into something
semi-understandable...

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2024-09-24 15:59:26 Re: Why mention to Oracle ?
Previous Message Fujii Masao 2024-09-24 15:46:24 Re: Add on_error and log_verbosity options to file_fdw