Quick Links

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From:	Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject:	Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date:	2023-10-26 22:16:32
Message-ID:	CALj2ACVjzHgQvONnTLffyBJRACPwhsf-cYG4TQ2KHKrFfFop-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Oct 25, 2023 at 5:45 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> Comments:

Thanks for reviewing.

> * It would be good to document that this is partially an optimization
> (read from memory first) and partially an API difference that allows
> reading unflushed data. For instance, walsender may benefit
> performance-wise (and perhaps later with the ability to read unflushed
> data) whereas pg_walinspect benefits primarily from reading unflushed
> data.

Commit message has these things covered in detail. However, I think
adding some info in the code comments is a good idea and done around
the WALRead() function in the attached v13 patch set.

> * Shouldn't there be a new method in XLogReaderRoutine (e.g.
> read_unflushed_data), rather than having logic in WALRead()? The
> callers can define the method if it makes sense (and that would be a
> good place to document why); or leave it NULL if not.

I've designed the new function XLogReadFromBuffers to read from WAL
buffers in such a way that one can easily embed it in page_read
callbacks if it makes sense. Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words, I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

> * I'm not clear on the "partial hit" case. Wouldn't that mean you found
> the earliest byte in the buffers, but not the latest byte requested? Is
> that possible, and if so under what circumstances? I added an
> "Assert(nread == 0 || nread == count)" in WALRead() after calling
> XLogReadFromBuffers(), and it wasn't hit.
>
> * If the partial hit case is important, wouldn't XLogReadFromBuffers()
> fill in the end of the buffer rather than the beginning?

Partial hit was possible when the requested WAL pages are read one
page at a time from WAL buffers with WALBufMappingLock
acquisition-release for each page as the requested page can be
replaced by the time the lock is released and reacquired. This was the
case up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com.
Now that the approach has been changed to read multiple pages at once
under one WALBufMappingLock acquisition-release. .
We can either keep the partial hit handling (just to not miss
anything) or turn the following partial hit case to an error or an
Assert(false);. I prefer to keep the partial hit handling as-is just
in case:
+ else if (count > nread)
+ {
+ /*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

> * Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
> effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
> too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
> expectedEndPtr both before and after the memcpy(), with appropriate
> barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
> and Masahiko Sawada.

Yes, I proposed that idea in another thread -
https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com.
If that looks okay, I can add it to the next version of this patch
set.

> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

Up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com,
the requested WAL was being read one page at a time from WAL buffers
into output buffer with one memcpy call for each page. Now that the
approach has been changed to read multiple pages at once under one
WALBufMappingLock acquisition-release with comparatively lesser number
of memcpy calls. I honestly haven't seen any difference between the
two approaches -
https://www.postgresql.org/message-id/CALj2ACUpQGiwQTzmoSMOFk5%3DWiJc06FcYpxzBX0SEej4ProRzg%40mail.gmail.com.

The new approach of reading multiple pages at once under one
WALBufMappingLock acquisition-release clearly wins over reading one
page at a time with multiple lock acquisition-release cycles.

> * Style-wise, the use of "unlikely" seems excessive, unless there's a
> reason to think it matters.

Given the current use of XLogReadFromBuffers, the input parameters are
passed as expected, IOW, these are unlikely events. The comments [1]
say that the unlikely() is to be used in hot code paths; I think
reading WAL from buffers is a hot code path especially when called
from (logical, physical) walsenders. If there's any stronger reason
than the appearance/style-wise, I'm okay to not use them. For now,
I've retained them.

FWIW, I found heapam.c using unlikely() extensively for safety checks.

[1]
* Hints to the compiler about the likelihood of a branch. Both likely() and
* unlikely() return the boolean value of the contained expression.
*
* These should only be used sparingly, in very hot code paths. It's very easy
* to mis-estimate likelihoods.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v13-0001-Allow-WAL-reading-from-WAL-buffers.patch	application/octet-stream	11.7 KB
v13-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patch	application/octet-stream	9.9 KB

In response to

Re: Improve WALRead() to suck data directly from WAL buffers when possible at 2023-10-25 00:15:19 from Jeff Davis

Responses

Re: Improve WALRead() to suck data directly from WAL buffers when possible at 2023-10-27 20:52:56 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2023-10-26 22:36:25	Re: Document aggregate functions better w.r.t. ORDER BY
Previous Message	David G. Johnston	2023-10-26 22:09:26	Re: Document aggregate functions better w.r.t. ORDER BY