Quick Links

Re: WIP: WAL prefetch (another approach)

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc:	Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: WAL prefetch (another approach)
Date:	2022-03-11 05:31:13
Message-ID:	CA+hUKGKusWunqOd9En+FkT+US_kuDK-MSZyXHCr=OTwRw8Ah1A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> I for now went through 0001, TL;DR the patch looks good to me. I have a few
> minor comments though, mostly to make things a bit clearer (at least to me).

Hi Julien,

Thanks for your review of 0001! It gave me a few things to think
about and some good improvements.

> diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
> index 2340dc247b..c129df44ac 100644
> --- a/src/bin/pg_waldump/pg_waldump.c
> +++ b/src/bin/pg_waldump/pg_waldump.c
> @@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
> * add an accessor macro for this.
> */
> *fpi_len = 0;
> + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (XLogRecHasBlockImage(record, block_id))
> - *fpi_len += record->blocks[block_id].bimg_len;
> + *fpi_len += record->record->blocks[block_id].bimg_len;
> }
> (and similar in that file, xlogutils.c and xlogreader.c)
>
> This could use XLogRecGetBlock? Note that this macro is for now never used.

Yeah, I think that is a good idea for pg_waldump.c and xlogutils.c. Done.

> xlogreader.c also has some similar forgotten code that could use
> XLogRecMaxBlockId.

That is true, but I was thinking of it like this: most of the existing
code that interacts with xlogreader.c is working with the old model,
where the XLogReader object holds only one "current" record. For that
reason the XLogRecXXX() macros continue to work as before, implicitly
referring to the record that XLogReadRecord() most recently returned.
For xlogreader.c code, I prefer not to use the XLogRecXXX() macros,
even when referring to the "current" record, since xlogreader.c has
switched to a new multi-record model. In other words, they're sort of
'old API' accessors provided for continuity. Does this make sense?

> + * See if we can release the last record that was returned by
> + * XLogNextRecord(), to free up space.
> + */
> +void
> +XLogReleasePreviousRecord(XLogReaderState *state)
>
> The comment seems a bit misleading, as I first understood it as it could be
> optional even if the record exists. Maybe something more like "Release the
> last record if any"?

Done.

> + * Remove it from the decoded record queue. It must be the oldest item
> + * decoded, decode_queue_tail.
> + */
> + record = state->record;
> + Assert(record == state->decode_queue_tail);
> + state->record = NULL;
> + state->decode_queue_tail = record->next;
>
> The naming is a bit counter intuitive to me, as before reading the rest of the
> code I wasn't expecting the item at the tail of the queue to have a next
> element. Maybe just inverting tail and head would make it clearer?

Yeah, after mulling this over for a day, I agree. I've flipped it around.

Explanation: You're quite right, singly-linked lists traditionally
have a 'tail' that points to null, so it makes sense for new items to
be added there and older items to be consumed from the 'head' end, as
you expected. But... it's also typical (I think?) in ring buffers AKA
circular buffers to insert at the 'head', and remove from the 'tail'.
This code has both a linked-list (the chain of decoded records with a
->next pointer), and the underlying storage, which is a circular
buffer of bytes. I didn't want them to use opposite terminology, and
since I started by writing the ring buffer part, that's where I
finished up... I agree that it's an improvement to flip them.

> +DecodedXLogRecord *
> +XLogNextRecord(XLogReaderState *state, char **errormsg)
> +{
> [...]
> + /*
> + * state->EndRecPtr is expected to have been set by the last call to
> + * XLogBeginRead() or XLogNextRecord(), and is the location of the
> + * error.
> + */
> +
> + return NULL;
>
> The comment should refer to XLogFindNextRecord, not XLogNextRecord?

No, it does mean to refer to the XLogNextRecord() (ie the last time
you called XLogNextRecord and successfully dequeued a record, we put
its end LSN there, so if there is a deferred error, that's the
corresponding LSN). Make sense?

> Also, is it worth an assert (likely at the top of the function) for that?

How could I assert that EndRecPtr has the right value?

> XLogRecord *
> XLogReadRecord(XLogReaderState *state, char **errormsg)
> +{
> [...]
> + if (decoded)
> + {
> + /*
> + * XLogReadRecord() returns a pointer to the record's header, not the
> + * actual decoded record. The caller will access the decoded record
> + * through the XLogRecGetXXX() macros, which reach the decoded
> + * recorded as xlogreader->record.
> + */
> + Assert(state->record == decoded);
> + return &decoded->header;
>
> I find it a bit weird to mention XLogReadRecord() as it's the current function.

Changed to "This function ...".

> +/*
> + * Allocate space for a decoded record. The only member of the returned
> + * object that is initialized is the 'oversized' flag, indicating that the
> + * decoded record wouldn't fit in the decode buffer and must eventually be
> + * freed explicitly.
> + *
> + * Return NULL if there is no space in the decode buffer and allow_oversized
> + * is false, or if memory allocation fails for an oversized buffer.
> + */
> +static DecodedXLogRecord *
> +XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
>
> Is it worth clearly stating that it's the reponsability of the caller to update
> the decode_buffer_head (with the real size) after a successful decoding of this
> buffer?

Comment added.

> + if (unlikely(state->decode_buffer == NULL))
> + {
> + if (state->decode_buffer_size == 0)
> + state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
> + state->decode_buffer = palloc(state->decode_buffer_size);
> + state->decode_buffer_head = state->decode_buffer;
> + state->decode_buffer_tail = state->decode_buffer;
> + state->free_decode_buffer = true;
> + }
>
> Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
> here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
> the only caller is the recovery prefetching.

I don't think it matters much?

> + return decoded;
> +}
>
> I would find it a bit clearer to explicitly return NULL here.

Done.

> readOff = ReadPageInternal(state, targetPagePtr,
> Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
> - if (readOff < 0)
> + if (readOff == XLREAD_WOULDBLOCK)
> + return XLREAD_WOULDBLOCK;
> + else if (readOff < 0)
>
> ReadPageInternal comment should be updated to mention the new XLREAD_WOULDBLOCK
> possible return value.

Yeah. Done.

> It's also not particulary obvious why XLogFindNextRecord() doesn't check for
> this value. AFAICS callers don't (and should never) call it with a
> nonblocking == true state, maybe add an assert for that?

Fair point. I have now explicitly cleared that flag. (I don't much
like state->nonblocking, which might be better as an argument to
page_read(), but in fact I don't like the fact that page_read
callbacks are blocking in the first place, which is why I liked
Horiguchi-san's patch to get rid of that... but that can be a subject
for later work.)

> @@ -468,7 +748,7 @@ restart:
> if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
> {
> state->overwrittenRecPtr = RecPtr;
> - ResetDecoder(state);
> + //ResetDecoder(state);
>
> AFAICS this is indeed not necessary anymore, so it can be removed?

Oops, yeah I use C++ comments when there's something I intended to
remove. Done.

> static void
> ResetDecoder(XLogReaderState *state)
> {
> [...]
> + /* Reset the decoded record queue, freeing any oversized records. */
> + while ((r = state->decode_queue_tail))
>
> nit: I think it's better to explicitly check for the assignment being != NULL,
> and existing code is more frequently written this way AFAICS.

I think it's perfectly normal idiomatic C, but if you think it's
clearer that way, OK, done like that.

> +/* Return values from XLogPageReadCB. */
> +typedef enum XLogPageReadResultResult
>
> typo

Fixed.

I realised that this version has broken -DWAL_DEBUG. I'll fix that
shortly, but I wanted to post this update ASAP, so here's a new
version. The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests. Right now you need to put
recovery_prefetch=on in a file and then run the tests with
"TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
excercise much of 0002.

Attachment	Content-Type	Size
v22-0001-Add-circular-WAL-decoding-buffer-take-II.patch	text/x-patch	50.9 KB
v22-0002-Prefetch-referenced-data-in-recovery-take-II.patch	text/x-patch	70.3 KB

In response to

Re: WIP: WAL prefetch (another approach) at 2022-03-09 06:46:33 from Julien Rouhaud

Responses

Re: WIP: WAL prefetch (another approach) at 2022-03-11 05:35:26 from Thomas Munro
Re: WIP: WAL prefetch (another approach) at 2022-03-11 06:03:18 from Andres Freund
Re: WIP: WAL prefetch (another approach) at 2022-03-11 08:27:37 from Julien Rouhaud

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2022-03-11 05:35:26	Re: WIP: WAL prefetch (another approach)
Previous Message	Ashutosh Sharma	2022-03-11 05:15:25	Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints