Re: Logical Replica ReorderBuffer Size Accounting Issues

From: Alex Richman <alexrichman(at)onesignal(dot)com>
To: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Niels Stevens <niels(dot)stevens(at)onesignal(dot)com>
Subject: Re: Logical Replica ReorderBuffer Size Accounting Issues
Date: 2023-01-11 15:41:30
Message-ID: CAMnUB3pGWcUL08fWB4QmO0+2yNBBckXq=ndyLoGAU+V_2WQaCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, 10 Jan 2023 at 11:22, wangw(dot)fnst(at)fujitsu(dot)com <wangw(dot)fnst(at)fujitsu(dot)com>
wrote:

> In summary, with the commit c6e0fe1f2a in master, the additional space
> allocated in the context is reduced. But I think this size difference
> seems to
> be inconsistent with what you meet. So I think the issues you meet seems
> not to
> be caused by the problem improved by this commit on master. How do you
> think?
>
Agreed - I see a few different places where rb->size can disagree with
the allocation size, but nothing that would produce a delta of 200KB vs
7GiB. I think the issue lies somewhere within the allocator itself (more
below).

> If possible, could you please share which version of PG the issue occurs
> on,
> and could you please also try to reproduce the problem on master?
>
We run 15.1-1 in prod, I have been trying to replicate the issue on that
also.

So far I have a partial replication of the issue by populating a table of
schema (id UUID PRIMARY KEY, data JSONB) with some millions of rows, then
doing some updates on them (I ran 16 of these concurrently each acting on
1/16th of the rows):
UPDATE test SET data = data || '{"test_0": "1", "test_1": "1", "test_2":
"1", "test_3": "1", "test_4": "1", "test_5": "1", "test_6": "1", "test_7":
"1", "test_8": "1", "test_9": "1", "test_a": "1", "test_b": "1", "test_c":
"1", "test_d": "1", "test_e": "1", "test_f": "1"}' @- '{test_0}';
This does cause the walsender memory to grow to ~1GiB, even with a
configured logical_decoding_work_mem of 256MB. However it is not a perfect
replication of the issue we see in prod, because rb->size does eventually
grow to 256MB and start streaming transactions so the walsender memory does
not grow up to the level we see in prod.

I believe the updates in prod are large numbers of updates to single rows,
which may be the relevant difference, but I am having trouble producing
enough update volume on the test systems to simulate it. For some idea of
the scale, those update statements in prod are producing ~6million WAL
records per minute according to pg_stat_statements.

To tie off the reorder buffer size question, I built a custom 15.1 from
source with a patch that passes the tuple alloc_size from
ReorderBufferGetTupleBuf through to ReorderBufferChangeSize so it can
account rb->size on that instead of t_len. This had no notable effect on
rb->size relative to the process RSS, so I think I agree that the issue
lies deeper within the Generation memory context, not within the reorder
buffer accounting.

Suspicious of GenerationAlloc, I then patched it to log its decision
making, and found that it was disproportionally failing to allocate space
within the freeblock+keeper, such that almost every Tuple context
allocation was mallocing a new block. I don't really understand the
structure of the allocator, but based on the logging, freeblock was
consistently NULL and keeper had GenerationBlockFreeBytes() = 0. I feel I
lack the postgres codebase understanding to investigate further with this.
Incidentally there is a fixup comment [1] which suggests the generation
context sizes may be out of date.

Further to the memory allocation strangeness. I noticed that there is a
lot of heap fragmentation within the walsender processes. After the spike
in Tuples memory context has returned to normal, the RSS of the process
itself remains at peak for ~10-20 minutes. Inspecting core dumps of these
processes with core_analyzer shows all the memory is actually free and is
not being reclaimed due to heap fragmentation. Perhaps fragmentation
within the memory context due to the allocation sizes vs the chunk size in
GenerationAlloc is also a factor.

Thanks,
- Alex.

[1]
https://github.com/postgres/postgres/blob/c5dc80c1bc216f0e21a2f79f5e0415c2d4cfb35d/src/backend/replication/logical/reorderbuffer.c#L332

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-01-11 17:39:06 Re: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic
Previous Message PG Bug reporting form 2023-01-11 15:09:21 BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic