Re: AIO writes vs hint bits vs checksums

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: AIO writes vs hint bits vs checksums
Date: 2024-10-30 12:16:51
Message-ID: bc2183ec-3728-4584-8999-888081760138@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24/09/2024 18:55, Andres Freund wrote:
> What I'd instead like to propose is to implement the right to set hint bits as
> a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> BM_SETTING_HINTS to be unset to start IO.
>
> Naively implementing this, by acquiring and releasing the permission to set
> hint bits in SetHintBits() unfortunately leads to a significant performance
> regression. While the performance is unaffected for OLTPish workloads like
> pgbench (both read and write), sequential scans of unhinted tables regress
> significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear
BM_SETTING_HINTS every time. Could we put the overhead on the
FlushBuffer() instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the
amount of changes could be pretty small. The missing operation we don't
have today is reading the share-lock count on the lock in step 3. That
seems simple to add.

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block
new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we
need, but it currently ignores share-lockers. So doing this "properly"
would require more changes to LWLocks. Briefly acquiring an exclusive
lock seems acceptable though.

> But: We can address this and improve performance over the status quo! Today we
> determine tuple visiblity determination one-by-one, even when checking the
> visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-10-30 12:31:36 Re: doc issues in event-trigger-matrix.html
Previous Message Robert Haas 2024-10-30 12:09:17 Re: Reduce one comparison in binaryheap's sift down