Re: AIO writes vs hint bits vs checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: AIO writes vs hint bits vs checksums
Date: 2024-09-24 20:30:25
Message-ID: jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5@jhtrriubncjb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > So far the AIO patchset has solved this by introducing a set of "bounce
> > buffers", which can be acquired and used as the source/target of IO when doing
> > it in-place into shared buffers isn't viable.
> >
> > I am worried about that solution however, as either acquisition of bounce
> > buffers becomes a performance issue (that's how I did it at first, it was hard
> > to avoid regressions) or we reserve bounce buffers for each backend, in which
> > case the memory overhead for instances with relatively small amount of
> > shared_buffers and/or many connections can be significant.
>
> > But: We can address this and improve performance over the status quo! Today we
> > determine tuple visiblity determination one-by-one, even when checking the
> > visibility of an entire page worth of tuples. That's not exactly free. I've
> > prototyped checking visibility of an entire page of tuples at once and it
> > indeed speeds up visibility checks substantially (in some cases seqscans are
> > over 20% faster!).
>
> Nice! It sounds like you refactored the relationship between
> heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
> bit setting upward or the iterate-over-tuples downward. Is that about right?

I've tried about five variations, so I don't have one answer to this yet :).

One problem is that having repeated loops doing PageGetItemId(),
PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be
hidden by allowing for better superscalar execution, but not entirely.

I've been somewhat confused by the compiler generated code around ItemId
handling for a while, it looks way more expensive than it should - it
regularly is a bottleneck due to the sheer number of instructions being
executed leading to being frontend bound. But never quite looked into it
deeply enough to figure out what's causing it / how to fix it.

> > Once we have page-level visibility checks we can get the right to set hint
> > bits once for an entire page instead of doing it for every tuple - with that
> > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> > wins.
>
> How did page-level+BM_SETTING_HINTS performance compare to performance of the
> page-level change w/o the BM_SETTING_HINTS change?

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

I've been fighting that with AIO as well, so maybe it's time to figure out the
memory ordering rules that'd allow to check that without a full spinlock
acquisition.

If we figure it out, we perhaps should use the chance to get rid of
BM_PIN_COUNT_WAITER...

> > Does this sound like a reasonable idea? Counterpoints?
>
> I guess the main part left to discuss is index scans or other scan types where
> we'd either not do page-level visibility or we'd do page-level visibility
> including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't
> show up so readily in index scan profiles, but the cost is still there.

I could indeed not make it show up in some simple index lookup heavy
workloads. I need to try some more extreme cases though (e.g. fetching all
tuples in a table via an index or having very long HOT chains).

If it's not visible cost-wise compared to all the other costs of index scans -
does it matter? If it's not visible it's either because it proportionally is
very small or because it's completely hidden by superscalar execution.

Another thing I forgot to mention that probably fits into the "tradeoffs"
bucket: Because BM_SETTING_HINTS would be just a single bit, one backend
setting hint bits would block out another backend setting hint bits. In most
situations that'll be fine, or even faster than not doing so due to reducing
cache line ping-pong, but in cases of multiple backends doing index lookups to
different unhinted tuples on the same page it could be a bit worse.

But I suspect that's fine because it's hard to believe that you could have
enough index lookups to unhinted tuples for that to be a bottleneck -
something has to produce all those unhinted tuples after all, and that's
rather more expensive. And for single-tuple visibility checks the window in
which hint bits are set is very small.

> How should we think about comparing the distributed cost of the buffer
> header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up. If there are cases where the cost of the more expensive hint
bit logic does show up, it'll get a lot harder to weigh.

So far my prototype uses the path that avoids hint bits being set while IO is
going on all the time, not just when checksums are enabled. We could change
that, I guess. However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

A thing I forgot to mention: Bounce buffers are kind of an architectural dead
end, in that we wouln't need them in that form if we get to a threaded
postgres.

Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-09-24 20:30:55 Re: [PATCH] Add native windows on arm64 support
Previous Message Nathan Bossart 2024-09-24 20:26:47 Re: pg_ctl/miscinit: print "MyStartTime" as a long long instead of long to avoid 2038 problem.