Re: AIO writes vs hint bits vs checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: AIO writes vs hint bits vs checksums
Date: 2024-10-30 16:45:27
Message-ID: r3ojcaelbruczzxegup2fqwkdk35p4rjju7kycbskesvh33oep@fojjo4teqx56
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:
> On 30/10/2024 04:21, Andres Freund wrote:
> > Attached is a, unfortunately long, series of patches implementing what I
> > described upthread.
>
> Review of the preparatory patches:
>
> > 0001 Add very basic test for kill_prior_tuples
> >
> > We currently don't exercise this patch for gist and hash, which seems
> > somewhat criminal.
>
> Interesting to use the isolationtester for this. There's just one session,
> so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.

> I'm fine with that, but please add a comment to explain it.

Makes sense.

> I wonder if it'd be more straightforward to make it a regular pg_regress
> test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?

> You forgot to add the new test to 'isolation_schedule'.

Oops.

> typos:
> "inex" -> "index"
> "does something approximately reasonble" -> "do something approximately
> reasonable"

Oops^2.

> > +/*
> > + * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
> > + * adjust hint bits. See the comment for SetHintBits() for more background.
> > + *
> > + * This helper returns false if the row ought to be invisible, true otherwise.
> > + */
> > +static inline bool
> > +HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
> > +{
> > + TransactionId xvac;
> > +
> > + /* only used by pre-9.0 binary upgrades */
> > + if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
> > + return true;
> > +
> > + ...
> > +}
>
> This is so unlikely these days that this function probably should not be
> inlined anymore.

Because I put the check for moved in the function it is important that it's
inlined. The compiler can understand that the rest of the function is unlikely
to be executed and put it together with other less likely to be executed code.

> > 0003 bufmgr: Add BufferLockHeldByMe()
> >
> > Useful for writing assertions in later patches.
>
> > + if (mode == BUFFER_LOCK_EXCLUSIVE)
> > + lwmode = LW_EXCLUSIVE;
> > + else if (mode == BUFFER_LOCK_SHARE)
> > + lwmode = LW_SHARED;
> > + else
> > + {
> > + Assert(false);
> > + pg_unreachable();
> > + lwmode = LW_EXCLUSIVE; /* assuage compiler */
> > + }
>
> I just learned a new word :-).

Heh.

> This is fine, but to avoid the assuaging, you could also do this:
>
> if (mode == BUFFER_LOCK_EXCLUSIVE)
> lwmode = LW_EXCLUSIVE;
> else
> {
> Assert(mode == BUFFER_LOCK_SHARE);
> lwmode = LW_SHARED;
> }

I have no opinion about which version is better...

> > 0004 heapam: Use exclusive lock on old page in CLUSTER
> >
> > When I started looking into this I wanted to make sure that it's actually
> > safe to skip setting hint bits in all places. One place that's *not* safe
> > is cluster - we assume that all hint bits are set in rewrite_heap_tuple().
> >
> > This might not strictly be required, but it seemed very fragile as
> > is. While using an exclusive lock does seem like it makes it a bit safer,
> > it's not a very good fix, as we still dirty the old buffer, which seems
> > like a shame when one uses VACUUM FULL.
>
> > + /*
> > + * To be able to guarantee that we can set the hint bit, acquire an
> > + * exclusive lock on the old buffer. We need the hint bits to be set
> > + * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
> > + * heap_freeze_tuple() will get confused.
> > + *
>
> Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
> rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits
> to be set.

There's this comment is in heap_prepare_freeze_tuple():

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

I don't remember the precise details of what all goes wrong, I think there was
more than one avenue.

One example is stuff like this:
/*
* If the tuple has been updated, check the old-to-new mapping hash table.
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
!(ItemPointerEquals(&(old_tuple->t_self),
&(old_tuple->t_data->t_ctid))))

if the original tuple had an xmax but we didn't unset it in
HeapTupleSatisfiesVacuum(), due to not being allowed to write a hint bit,
we'll behave differently here. But at the same time
heapam_relation_copy_for_cluster() will treat the tuple as dead and call
rewrite_heap_dead_tuple().

> How does it get confused?

Corrupted visibility information after a CLUSTER.

> > 0006 heapam: Add batch mode mvcc check and use it in page mode
> >
> > This is a good bit faster on its own, but is required to avoid a
> > performance regression later, when setting hint bits only only when no IO
> > going on at the same.
>
> The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd
> make any difference how you pass those arguments.

I unfortunately found it to make a difference :(, even after trying to account
for binary layout difference by randomizing section order via linker flags.

> > + if (likely(valid))
> > + {
> > + vistuples_dense[nvis] = tup->t_self.ip_posid;
> > + nvis++;
> > + }
>
> Is the likely(valid) hint important for performance here? It would be a bad
> hint for a table with lots of dead tuples. I feel we'd better rely on the
> branch predictor for this one.

It did :(

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2024-10-30 17:17:47 Re: protocol-level wait-for-LSN
Previous Message Nathan Bossart 2024-10-30 16:30:04 Re: Reduce one comparison in binaryheap's sift down