Re: Using per-transaction memory contexts for storing decoded tuples

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using per-transaction memory contexts for storing decoded tuples
Date: 2024-09-16 17:12:50
Message-ID: CAD21AoADP-MSfie8VB_CykQ8zQ=X3QnnTO=h+19mc_HaYFVZVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > We have several reports that logical decoding uses memory much more
> > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > reports[1], even though users set logical_decoding_work_mem to
> > '256MB', a walsender process was killed by OOM because of using more
> > than 4GB memory.
> >
> > As per the discussion in these threads so far, what happened was that
> > there was huge memory fragmentation in rb->tup_context.
> > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > cannot free memory blocks until all memory chunks in the block are
> > freed. If there is a long-running transaction making changes, its
> > changes could be spread across many memory blocks and we end up not
> > being able to free memory blocks unless the long-running transaction
> > is evicted or completed. Since we don't account fragmentation, block
> > header size, nor chunk header size into per-transaction memory usage
> > (i.e. txn->size), rb->size could be less than
> > logical_decoding_work_mem but the actual allocated memory in the
> > context is much higher than logical_decoding_work_mem.
> >
>
> It is not clear to me how the fragmentation happens. Is it because of
> some interleaving transactions which are even ended but the memory
> corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

> Can we try reducing the size of
> 8MB memory blocks? The comment atop allocation says: "XXX the
> allocation sizes used below pre-date generation context's block
> growing code. These values should likely be benchmarked and set to
> more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

>
> > We can reproduce this issue with the attached patch
> > rb_excessive_memory_reproducer.patch (not intended to commit) that
> > adds a memory usage reporting and includes the test. After running the
> > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> > usage report in the server logs like follows:
> >
> > LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
> > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
> >
> > Which means that the logical decoding allocated 1GB memory in spite of
> > logical_decoding_work_mem being 256MB.
> >
> > One idea to deal with this problem is that we use
> > MemoryContextMemAllocated() as the reorderbuffer's memory usage
> > instead of txn->total_size. That is, we evict transactions until the
> > value returned by MemoryContextMemAllocated() gets lower than
> > logical_decoding_work_mem. However, it could end up evicting a lot of
> > (possibly all) transactions because the transaction whose decoded
> > tuples data are spread across memory context blocks could be evicted
> > after all other larger transactions are evicted (note that we evict
> > transactions from largest to smallest). Therefore, if we want to do
> > that, we would need to change the eviction algorithm to for example
> > LSN order instead of transaction size order. Furthermore,
> > reorderbuffer's changes that are counted in txn->size (and therefore
> > rb->size too) are stored in different memory contexts depending on the
> > kinds. For example, decoded tuples are stored in rb->context,
> > ReorderBufferChange are stored in rb->change_context, and snapshot
> > data are stored in builder->context. So we would need to sort out
> > which data is stored into which memory context and which memory
> > context should be accounted for in the reorderbuffer's memory usage.
> > Which could be a large amount of work.
> >
> > Another idea is to have memory context for storing decoded tuples per
> > transaction. That way, we can ensure that all memory blocks of the
> > context are freed when the transaction is evicted or completed. I
> > think that this idea would be simpler and worth considering, so I
> > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> > decoded tuple data is created individually when the corresponding WAL
> > record is decoded but is released collectively when it is released
> > (e.g., transaction eviction), the bump memory context would fit the
> > best this case. One exception is that we immediately free the decoded
> > tuple data if the transaction is already aborted. However, in this
> > case, I think it is possible to skip the WAL decoding in the first
> > place.
> >
> > One thing we need to consider is that the number of transaction
> > entries and memory contexts in the reorderbuffer could be very high
> > since it has entries for both top-level transaction entries and
> > sub-transactions. To deal with that, the patch changes that decoded
> > tuples of a subtransaction are stored into its top-level transaction's
> > tuple context.
> >
>
> Won't that impact the calculation for ReorderBufferLargestTXN() which
> can select either subtransaction or top-level xact?

Yeah, I missed that we could evict only sub-transactions when choosing
the largest transaction. We need to find a better solution.

>
> > We always evict sub-transactions and its top-level
> > transaction at the same time, I think it would not be a problem.
> > Checking performance degradation due to using many memory contexts
> > would have to be done.
> >
>
> The generation context has been introduced in commit a4ccc1ce which
> claims that it has significantly reduced logical decoding's memory
> usage and improved its performance. Won't changing it requires us to
> validate all the cases which led us to use generation context in the
> first place?

Agreed. Will investigate the thread and check the cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2024-09-16 17:25:17 Re: Document DateStyle effect on jsonpath string()
Previous Message Bharath Rupireddy 2024-09-16 17:10:52 Re: Introduce XID age and inactive timeout based replication slot invalidation