Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shubham Khanna <khannashubham1197(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-03-14 03:02:39
Message-ID: CAD21AoAmj_gQt3LadRO5YrUr109AHBnbMh4Qt4cx7Wv3UwYJTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 11, 2024 at 3:04 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are some review comments for v8-0003
>
> ======
> 0. GENERAL -- why the state enum?
>
> This patch introduced a new ReorderBufferMemTrackState with 2 states
> (REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
> REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
>
> It's the same as having a boolean flag OFF/ON, so I didn't see any
> benefit of the enum instead of a simple boolean flag like
> 'track_txn_sizes'.
>
> NOTE: Below in this post (see #11) I would like to propose another
> idea, which can simplify much further, eliminating the need for the
> state boolean. If adopted that will impact lots of these other review
> comments.

Good point! We used to use three states in the earlier version patch
but now that we have only two we don't necessarily need to use an
enum. I've used your idea.

>
> ======
> Commit Message
>
> 1.
> Previously, when selecting the transaction to evict during logical
> decoding, we check all transactions to find the largest
> transaction. Which could lead to a significant replication lag
> especially in case where there are many subtransactions.
>
> ~
>
> /Which could/This could/
>
> /in case/in the case/

Fixed.

>
> ======
> .../replication/logical/reorderbuffer.c
>
> 2.
> * We use a max-heap with transaction size as the key to efficiently find
> * the largest transaction. The max-heap state is managed in two states:
> * REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
> REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
>
> /The max-heap state is managed in two states:/The max-heap is managed
> in two states:/

This part is removed.

>
> ~~~
>
> 3.
> +/*
> + * Threshold of the total number of top-level and sub transactions
> that controls
> + * whether we switch the memory track state. While using max-heap to select
> + * the largest transaction is effective when there are many transactions being
> + * decoded, in many systems there is generally no need to use it as long as all
> + * transactions being decoded are top-level transactions. Therefore, we use
> + * MaxConnections as the threshold* so we can prevent switch to the
> state unless
> + * we use subtransactions.
> + */
> +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
>
> 3a.
> /memory track state./memory tracking state./
>
> /While using max-heap/Although using max-heap/
>
> "in many systems" (are these words adding anything?)
>
> /threshold*/threshold/
>
> /so we can prevent switch/so we can prevent switching/
>

Fixed.

> ~
>
> 3b.
> There's nothing really in this name to indicate the units of the
> threshold. Consider if there is some more informative name for this
> macro: e.g.
> MAXHEAP_TX_COUNT_THRESHOLD (?)

Fixed.

>
> ~~~
>
> 4.
> + /*
> + * Don't start with a lower number than
> + * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
> + * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
> + */
> + buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
> + buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
> + ReorderBufferTXNSizeCompare,
> + true, NULL);
> +
>
> IIUC the comment intends to say:
>
> Allocate the initial heap size greater than THRESHOLD because the
> txn_heap will not be used until the threshold is exceeded.
>
> Also, maybe the comment should make a point of saying "Note: the
> binary heap is INDEXED for faster manipulations". or something
> similar.
>

Fixed.

> ~~~
>
> 5.
> static void
> ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
> ReorderBufferChange *change,
> + ReorderBufferTXN *txn,
> bool addition, Size sz)
> {
> - ReorderBufferTXN *txn;
> ReorderBufferTXN *toptxn;
>
> - Assert(change->txn);
> -
>
> There seems some trick now where the passed 'change' could be NULL,
> which was not possible before. e.g., when change is NULL then 'txn' is
> not NULL, and vice versa. Some explanation about this logic and the
> meaning of these parameters should be written in this function
> comment.

Added comments.

>
> ~
>
> 6.
> + txn = txn != NULL ? txn : change->txn;
>
> IMO it's more natural to code the ternary using the same order as the
> parameters:
>
> e.g. txn = change ? change->txn : txn;

I see your point. I changed it to:

if (txn == NULL)
txn = change->txn;

so we don't change txn if it's not NULL.

>
> ~~~
>
> 7.
> /*
> * Build the max-heap and switch the state. We will run a heap assembly step
> * at the end, which is more efficient.
> */
> static void
> ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
>
> /We will run a heap assembly step at the end, which is more
> efficient./The heap assembly step is deferred until the end, for
> efficiency./

Fixed.

>
> ~~~
>
> 8. ReorderBufferLargestTXN
>
> + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
> + {
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> +
> + hash_seq_init(&hash_seq, rb->by_txn);
> + while ((ent = hash_seq_search(&hash_seq)) != NULL)
> + {
> + ReorderBufferTXN *txn = ent->txn;
> +
> + /* if the current transaction is larger, remember it */
> + if ((!largest) || (txn->size > largest->size))
> + largest = txn;
> + }
> +
> + Assert(largest);
> + }
>
> That Assert(largest) seems redundant because there is anyway another
> Assert(largest) immediately after this code.

Removed.

>
> ~~~
>
> 9.
> + /* Get the largest transaction from the max-heap */
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
> + {
> + Assert(binaryheap_size(rb->txn_heap) > 0);
> + largest = (ReorderBufferTXN *)
> + DatumGetPointer(binaryheap_first(rb->txn_heap));
> }
> Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
> readable way of saying:
>
> Assert(!binaryheap_empty(rb->txn_heap));

Fixed.

>
> ~~~
>
> 10.
> +
> +/*
> + * Compare between sizes of two transactions. This is for a binary heap
> + * comparison function.
> + */
> +static int
> +ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
>
> 10a.
> /Compare between sizes of two transactions./Compare two transactions by size./

Fixed.

>
> ~~~
>
> 10b.
> IMO this comparator function belongs just before the
> ReorderBufferAllocate() function since that is the only place where it
> is used.

I think it's better to move close to new max-heap related functions.

>
> ======
> src/include/replication/reorderbuffer.h
>
> 11.
> +/* State of how to track the memory usage of each transaction being decoded */
> +typedef enum ReorderBufferMemTrackState
> +{
> + /*
> + * We don't update max-heap while updating the memory counter. The
> + * max-heap is built before use.
> + */
> + REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
> +
> + /*
> + * We also update the max-heap when updating the memory counter so the
> + * heap property is always preserved.
> + */
> + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
> +} ReorderBufferMemTrackState;
> +
>
> In my GENERAL review comment #0, I suggested the removal of this
> entire enum. e.g. It could be replaced with a boolean field
> 'track_txn_sizes'
>
> TBH, I think there is a better way to handle this "state". IIUC
> - the txn_heap is always allocated up-front.
> - you only "build" it when > threshold and
> - when it drops < 0.9 x threshold you reset it.
>
> Therefore, AFAICT you do not need to maintain any “switch states” at
> all; you simply need to check binaryheap_empty(txn_heap), right?
> * If the heap is empty…. It means you are NOT tracking, so don’t use it
> * If the heap is NOT empty …. It means you ARE tracking, so use it.
>
> ~
>
> Using my idea to remove the state flag will have the side effect of
> simplifying many other parts of this patch. For example
>
> BEFORE
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
> + {
> + rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
> + binaryheap_reset(rb->txn_heap);
> + }
> +}
>
> AFTER
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (binaryheap_empty(rb->txn_heap))
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
> + binaryheap_reset(rb->txn_heap);
> +}

Agreed. I removed the enum and changed the logic.

>
> ~~~
>
> 12. struct ReorderBuffer
>
> + /* Max-heap for sizes of all top-level and sub transactions */
> + ReorderBufferMemTrackState memtrack_state;
> + binaryheap *txn_heap;
> +
>
> 12a.
> Why is this being referred to in the commit message and code comments
> as "max-heap" when the field is not called by that same name? Won't it
> be better to give the field a better name -- e.g. "txn_maxheap" or
> similar?

Not sure it helps increase readability. Other codes where we use
binaryheap use neither max nor min in the field name.

>
> ~
>
> 12b.
> This comment should also say that the heap is ordered by tx size --
> (e.g. the comparator is ReorderBufferTXNSizeCompare)

It seems to me the comment "/* Max-heap for sizes of all top-level and
sub transactions */" already mentions that, no? I'm not sure we need
to refer to the actual function name here.

I've attached new version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v9-0001-Make-binaryheap-enlargeable.patch application/octet-stream 3.2 KB
v9-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch application/octet-stream 16.7 KB
v9-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch application/octet-stream 16.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-03-14 03:06:59 Re: Improve eviction algorithm in ReorderBuffer
Previous Message Tom Lane 2024-03-14 02:28:14 Re: ERROR: error triggered for injection point gin-leave-leaf-split-incomplete