Quick Links

Re: Using per-transaction memory contexts for storing decoded tuples

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Using per-transaction memory contexts for storing decoded tuples
Date:	2024-09-13 10:58:29
Message-ID:	CAA4eK1+eNFNa=TOvWY83rAdoWZ0r_8+Q8GEUD0YF4oP-v9nApg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.
>
> As per the discussion in these threads so far, what happened was that
> there was huge memory fragmentation in rb->tup_context.
> rb->tup_context uses GenerationContext with 8MB memory blocks. We
> cannot free memory blocks until all memory chunks in the block are
> freed. If there is a long-running transaction making changes, its
> changes could be spread across many memory blocks and we end up not
> being able to free memory blocks unless the long-running transaction
> is evicted or completed. Since we don't account fragmentation, block
> header size, nor chunk header size into per-transaction memory usage
> (i.e. txn->size), rb->size could be less than
> logical_decoding_work_mem but the actual allocated memory in the
> context is much higher than logical_decoding_work_mem.
>

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released? Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

> We can reproduce this issue with the attached patch
> rb_excessive_memory_reproducer.patch (not intended to commit) that
> adds a memory usage reporting and includes the test. After running the
> tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> usage report in the server logs like follows:
>
> LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
> rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
>
> Which means that the logical decoding allocated 1GB memory in spite of
> logical_decoding_work_mem being 256MB.
>
> One idea to deal with this problem is that we use
> MemoryContextMemAllocated() as the reorderbuffer's memory usage
> instead of txn->total_size. That is, we evict transactions until the
> value returned by MemoryContextMemAllocated() gets lower than
> logical_decoding_work_mem. However, it could end up evicting a lot of
> (possibly all) transactions because the transaction whose decoded
> tuples data are spread across memory context blocks could be evicted
> after all other larger transactions are evicted (note that we evict
> transactions from largest to smallest). Therefore, if we want to do
> that, we would need to change the eviction algorithm to for example
> LSN order instead of transaction size order. Furthermore,
> reorderbuffer's changes that are counted in txn->size (and therefore
> rb->size too) are stored in different memory contexts depending on the
> kinds. For example, decoded tuples are stored in rb->context,
> ReorderBufferChange are stored in rb->change_context, and snapshot
> data are stored in builder->context. So we would need to sort out
> which data is stored into which memory context and which memory
> context should be accounted for in the reorderbuffer's memory usage.
> Which could be a large amount of work.
>
> Another idea is to have memory context for storing decoded tuples per
> transaction. That way, we can ensure that all memory blocks of the
> context are freed when the transaction is evicted or completed. I
> think that this idea would be simpler and worth considering, so I
> attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> decoded tuple data is created individually when the corresponding WAL
> record is decoded but is released collectively when it is released
> (e.g., transaction eviction), the bump memory context would fit the
> best this case. One exception is that we immediately free the decoded
> tuple data if the transaction is already aborted. However, in this
> case, I think it is possible to skip the WAL decoding in the first
> place.
>
> One thing we need to consider is that the number of transaction
> entries and memory contexts in the reorderbuffer could be very high
> since it has entries for both top-level transaction entries and
> sub-transactions. To deal with that, the patch changes that decoded
> tuples of a subtransaction are stored into its top-level transaction's
> tuple context.
>

Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?

> We always evict sub-transactions and its top-level
> transaction at the same time, I think it would not be a problem.
> Checking performance degradation due to using many memory contexts
> would have to be done.
>

The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?

--
With Regards,
Amit Kapila.

In response to

Using per-transaction memory contexts for storing decoded tuples at 2024-09-11 22:32:39 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Richard Guo	2024-09-13 11:18:55	Re: Why don't we consider explicit Incremental Sort?
Previous Message	Ashutosh Bapat	2024-09-13 10:44:04	Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN