From: | "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Alex Richman <alexrichman(at)onesignal(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Niels Stevens <niels(dot)stevens(at)onesignal(dot)com> |
Subject: | RE: Logical Replica ReorderBuffer Size Accounting Issues |
Date: | 2023-05-24 10:13:02 |
Message-ID: | OS3PR01MB627537D530E5632A5AF8921A9E419@OS3PR01MB6275.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Wed, May 24, 2023 at 9:27 AMMasahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Tue, May 23, 2023 at 1:11 PM Wei Wang (Fujitsu)
> <wangw(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Thu, May 9, 2023 at 22:58 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > On Tue, May 9, 2023 at 6:06 PM Wei Wang (Fujitsu)
> > > > > I think there are two separate issues. One is a pure memory accounting
> > > > > issue: since the reorderbuffer accounts the memory usage by
> > > > > calculating actual tuple size etc. it includes neither the chunk
> > > > > header size nor fragmentations within blocks. So I can understand why
> > > > > the output of MemoryContextStats(rb->context) could be two or three
> > > > > times higher than logical_decoding_work_mem and doesn't match rb->size
> > > > > in some cases.
> > > > >
> > > > > However it cannot explain the original issue that the memory usage
> > > > > (reported by MemoryContextStats(rb->context)) reached 5GB in spite of
> > > > > logilca_decoding_work_mem being 256MB, which seems like a memory leak
> > > > > bug or something we ignore the memory limit.
> > > >
> > > > Yes, I agree that the chunk header size or fragmentations within blocks may
> > > > cause the allocated space to be larger than the accounted space. However, since
> > > > these spaces are very small (please refer to [1] and [2]), I also don't think
> > > > this is the cause of the original issue in this thread.
> > > >
> > > > I think that the cause of the original issue in this thread is the
> > > > implementation of 'Generational allocator'.
> > > > Please consider the following user scenario:
> > > > The parallel execution of different transactions led to very fragmented and
> > > > mixed-up WAL records for those transactions. Later, when walsender serially
> > > > decodes the WAL, different transactions' chunks were stored on a single block
> > > > in rb->tup_context. However, when a transaction ends, the chunks related to
> > > > this transaction on the block will be marked as free instead of being actually
> > > > released. The block will only be released when all chunks in the block are
> > > > free. In other words, the block will only be released when all transactions
> > > > occupying the block have ended. As a result, the chunks allocated by some
> > > > ending transactions are not being released on many blocks for a long time.
> > > > Then this issue occurred. I think this also explains why parallel execution is more
> > > > likely to trigger this issue compared to serial execution of transactions.
> > > > Please also refer to the analysis details of code in [3].
> > >
> > > After some investigation, I don't think the implementation of
> > > generation allocator is problematic but I agree that your scenario is
> > > likely to explain the original issue. Especially, the output of
> > > MemoryContextStats() shows:
> > >
> > > Tuples: 4311744512 total in 514 blocks (12858943 chunks);
> > > 6771224 free (12855411 chunks); 4304973288 used
> > >
> > > First, since the total memory allocation was 4311744512 bytes in 514
> > > blocks we can see there were no special blocks in the context (8MB *
> > > 514 = 4311744512 bytes). Second, it shows that the most chunks were
> > > free (12858943 chunks vs. 12855411 chunks) but most memory were used
> > > (4311744512 bytes vs. 4304973288 bytes), which means that there were
> > > some in-use chunks at the tail of each block, i.e. the most blocks
> > > were fragmented. I've attached another test to reproduce this
> > > behavior. In this test, the memory usage reaches up to almost 4GB.
> > >
> > > One idea to deal with this issue is to choose the block sizes
> > > carefully while measuring the performance as the comment shows:
> > >
> > > /*
> > > * XXX the allocation sizes used below pre-date generation context's block
> > > * growing code. These values should likely be benchmarked and set to
> > > * more suitable values.
> > > */
> > > buffer->tup_context = GenerationContextCreate(new_ctx,
> > > "Tuples",
> > > SLAB_LARGE_BLOCK_SIZE,
> > > SLAB_LARGE_BLOCK_SIZE,
> > > SLAB_LARGE_BLOCK_SIZE);
> > >
> > > For example, if I use SLAB_DEFAULT_BLOCK_SIZE, 8kB, the maximum memory
> > > usage was about 17MB in the test.
> >
> > Thanks for your idea.
> > I did some tests as you suggested. I think the modification mentioned above can
> > work around this issue in the test 002_rb_memory_2.pl on [1] (To reach the size
> > of large transactions, I set logical_decoding_work_mem to 1MB). But the test
> > repreduce.sh on [2] still reproduces this issue.
>
> Yes, it's because the above modification doesn't fix the memory
> accounting issue but only reduces memory bloat in some (extremely bad)
> cases. Without this modification, it was possible that the maximum
> actual memory usage could easily reach several tens of times as
> logical_decoding_work_mem (e.g. 4GB vs. 256MB as originally reported).
> Since the fact that the reorderbuffer doesn't account for memory
> fragmentation etc is still there, it's still possible that the actual
> memory usage could reach several times as logical_decoding_work_mem.
> In my environment, with reproducer.sh you shared, the total actual
> memory usage reached up to about 430MB while logical_decoding_work_mem
> being 256MB. Probably even if we use another type for memory allocator
> such as AllocSet a similar issue will still happen. If we don't want
> the reorderbuffer memory usage never to exceed
> logical_decoding_work_mem, we would need to change how the
> reorderbuffer uses and accounts for memory, which would require much
> work, I guess.
>
> > It seems that this modification
> > will fix a subset of use cases, But the issue still occurs for other use cases.
> >
> > I think that the size of a block may lead to differences in the number of
> > transactions stored on the block. For example, before the modification, a block
> > could store some changes of 10 transactions, but after the modification, a block
> > may only store some changes of 3 transactions. I think this means that once
> > these three transactions are committed, this block will be actually released.
> > As a result, the probability of the block being actually released is increased
> > after the modification.
>
> In addition to that, I think the size of a block may lead to
> differences in the consequences of memory fragmentation. The larger
> blocks, the larger fragmentation could happen within blocks.
>
> > Additionally, I think that the parallelism of the test
> > repreduce.sh is higher than that of the test 002_rb_memory_2.pl, which is also
> > the reason why this modification only fixed the issue in the test
> > 002_rb_memory_2.pl.
>
> Could you elaborate on why higher parallelism could affect this memory
> accounting issue more?
I think higher parallelism leads to greater fragmentation and mixing of the WAL.
This means that walsender is processing more transactions at the same time. IOW,
although the number of changes stored on a block has not changed, the number of
transactions which these changes belong to has increased. So, in order to
actually release this block, the number of transactions that need to be
committed will also increase. What's more, I think the freed-chunks in the block
could not be reused (see functions GenerationAlloc and GenerationFree). As a
result, I think we need to wait longer before the block can actually be released
in some cases, and then, this issue is more likely to be reproduced. Therefore,
I think that higher parallelism will make it more likely for the issue on this
thread to be reproduced.
Regards,
Wang wei
From | Date | Subject | |
---|---|---|---|
Next Message | Alexey Kachalin | 2023-05-24 11:40:26 | Re: Prepared SQL name collision. The name implicitly is truncated by NAMEDATALEN |
Previous Message | PG Bug reporting form | 2023-05-24 06:41:23 | BUG #17942: vacuumdb doesn't populate extended statistics on partitioned tables |