Re: Using per-transaction memory contexts for storing decoded tuples

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using per-transaction memory contexts for storing decoded tuples
Date: 2024-09-13 05:26:27
Message-ID: 162ff6202f3c12b5769f59c3eb2bc0ca@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2024-09-12 07:32, Masahiko Sawada wrote:

Thanks a lot for working on this!

> Hi all,
>
> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.
>
> As per the discussion in these threads so far, what happened was that
> there was huge memory fragmentation in rb->tup_context.
> rb->tup_context uses GenerationContext with 8MB memory blocks. We
> cannot free memory blocks until all memory chunks in the block are
> freed. If there is a long-running transaction making changes, its
> changes could be spread across many memory blocks and we end up not
> being able to free memory blocks unless the long-running transaction
> is evicted or completed. Since we don't account fragmentation, block
> header size, nor chunk header size into per-transaction memory usage
> (i.e. txn->size), rb->size could be less than
> logical_decoding_work_mem but the actual allocated memory in the
> context is much higher than logical_decoding_work_mem.
>
> We can reproduce this issue with the attached patch
> rb_excessive_memory_reproducer.patch (not intended to commit) that
> adds a memory usage reporting and includes the test. After running the
> tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> usage report in the server logs like follows:
>
> LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
> rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
>
> Which means that the logical decoding allocated 1GB memory in spite of
> logical_decoding_work_mem being 256MB.
>
> One idea to deal with this problem is that we use
> MemoryContextMemAllocated() as the reorderbuffer's memory usage
> instead of txn->total_size. That is, we evict transactions until the
> value returned by MemoryContextMemAllocated() gets lower than
> logical_decoding_work_mem. However, it could end up evicting a lot of
> (possibly all) transactions because the transaction whose decoded
> tuples data are spread across memory context blocks could be evicted
> after all other larger transactions are evicted (note that we evict
> transactions from largest to smallest). Therefore, if we want to do
> that, we would need to change the eviction algorithm to for example
> LSN order instead of transaction size order. Furthermore,
> reorderbuffer's changes that are counted in txn->size (and therefore
> rb->size too) are stored in different memory contexts depending on the
> kinds. For example, decoded tuples are stored in rb->context,
> ReorderBufferChange are stored in rb->change_context, and snapshot
> data are stored in builder->context. So we would need to sort out
> which data is stored into which memory context and which memory
> context should be accounted for in the reorderbuffer's memory usage.
> Which could be a large amount of work.
>
> Another idea is to have memory context for storing decoded tuples per
> transaction. That way, we can ensure that all memory blocks of the
> context are freed when the transaction is evicted or completed. I
> think that this idea would be simpler and worth considering, so I
> attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> decoded tuple data is created individually when the corresponding WAL
> record is decoded but is released collectively when it is released
> (e.g., transaction eviction), the bump memory context would fit the
> best this case. One exception is that we immediately free the decoded
> tuple data if the transaction is already aborted. However, in this
> case, I think it is possible to skip the WAL decoding in the first
> place.

I haven't read the patch yet, but it seems a reasonable approach.

> One thing we need to consider is that the number of transaction
> entries and memory contexts in the reorderbuffer could be very high
> since it has entries for both top-level transaction entries and
> sub-transactions. To deal with that, the patch changes that decoded
> tuples of a subtransaction are stored into its top-level transaction's
> tuple context. We always evict sub-transactions and its top-level
> transaction at the same time, I think it would not be a problem.
> Checking performance degradation due to using many memory contexts
> would have to be done.

Yeah, and I imagine there would be cases where the current
implementation shows better performance, such as when there are no long
transactions, but compared to unexpected memory bloat and OOM kill, I
think it's far more better.

> Even with this idea, we would still have a mismatch between the actual
> amount of allocated memory and rb->size, but it would be much better
> than today. And something like the first idea would be necessary to
> address this mismatch, and we can discuss it in a separate thread.
>
> Regards,
>
> [1]
> https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org
> [3]
> https://www.postgresql.org/message-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252%40DB9PR07MB7180.eurprd07.prod.outlook.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2024-09-13 05:27:17 Re: long-standing data loss bug in initial sync of logical replication
Previous Message Hunaid Sohail 2024-09-13 04:49:55 Re: Psql meta-command conninfo+