Re: Logical Replica ReorderBuffer Size Accounting Issues

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: Alex Richman <alexrichman(at)onesignal(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Niels Stevens <niels(dot)stevens(at)onesignal(dot)com>
Subject: Re: Logical Replica ReorderBuffer Size Accounting Issues
Date: 2023-01-18 10:10:07
Message-ID: CAA4eK1KdwDrLW2tE-jawqEADy8w=6nHtEwMWeno12WnZ+xEByQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Jan 13, 2023 at 4:47 PM wangw(dot)fnst(at)fujitsu(dot)com
<wangw(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Thu, Jan 12, 2023 at 21:02 PM Alex Richman <alexrichman(at)onesignal(dot)com> wrote:
> > On Thu, 12 Jan 2023 at 10:44, wangw(dot)fnst(at)fujitsu(dot)com
> > <wangw(dot)fnst(at)fujitsu(dot)com> wrote:
> > > I think parallelism doesn't affect this problem. Because for a walsender, I
> > > think it will always read the wal serially in order. Please let me know if I'm
> > > missing something.
> > I suspect it's more about getting enough changes into the WAL quickly enough
> > for walsender to not spend any time idle. I suppose you could stack the deck
> > towards this by first disabling the subscription, doing the updates to spool a
> > bunch of changes in the WAL, then enabling the subscription again. Perhaps
> > there is also some impact in the WAL records interleaving from the concurrent
> > updates and making more work for the reorder buffer.
> > The servers I am testing on are quite beefy, so it might be a little harder to
> > generate sufficient load if you're testing locally on a laptop or something.
> >
> > > And I tried to use the table structure and UPDATE statement you said. But
> > > unfortunately I didn't catch 1GB or unexpected (I mean a lot size beyond
> > 256MB)
> > > usage in rb->tup_context. Could you please help me to confirm my test? Here
> > is
> > > my test details:
> > Here's test scripts that replicate it for me: [1]
> > This is on 15.1, installed on debian-11, running on GCP n2-highmem-80 (IceLake)
> > /w 24x Local SSD in raid0.
>
> Thanks for the details you shared.
>
> Yes, I think you are right. I think I reproduced this problem as you suggested
> (Update the entire table in parallel). And I can reproduce this problem on both
> current HEAD and REL_15_1. The memory used in rb->tup_context can reach 350M
> in HEAD and reach 600MB in REL_15_1.
>
> Here are my steps to reproduce:
> 1. Apply the attached diff patch to add some logs for confirmation.
> 2. Use the attached reproduction script to reproduce the problem.
> 3. Confirm the debug log that is output to the log file pub.log.
>
> After doing some research, I agree with the idea you mentioned before. I think
> this problem is caused by the implementation of 'Generational allocator' or the
> way we uses the API of 'Generational allocator'.
>
> Here is my analysis:
> When we try to free the memory used in rb->tup_context in the function
> GenerationFree(), I think it is because of this if-condition [1] that the
> memory is not actually freed. So IIUC, in the function
> ReorderBufferReturnChange, rb->size will be reduced in the function
> ReorderBufferChangeMemoryUpdate, while the memory used in rb->tup_context may
> not be freed in the function ReorderBufferReturnTupleBuf. I think this is why
> the two counters don't match.
>
> BTW, after debugging, I think that compared to updating the entire table
> serially, if the entire table is updated in parallel, the frequency of meeting
> this condition will increase a lot. So I think this is why updating the tables
> in parallel is easier to reproduce this problem.
>

Yeah, this theory sounds like the cause of the problem because
normally we free all the tuples at the end of xact, however, in this
case as there are parallel txns, it may not get a chance to free it.
This should be less of a problem after the commit 1b0d9aa4 (which is
present in PG15) where we started to reuse the free chunks. I wonder
why you don't see the reuse in PG15? Is it because of large tuples, or
something that cause direct allocations, or something else?

Alex,
Do we see this problem with small tuples as well? I see from your
earlier email that tuple size is ~800 bytes in the production
environment. It is possible that after commit 1b0d9aa4 such kind of
problems are not there with small tuple sizes but that commit happened
in PG15 whereas your production environment might be on a prior
release.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alex Richman 2023-01-18 11:16:57 Re: Logical Replica ReorderBuffer Size Accounting Issues
Previous Message David Rowley 2023-01-18 08:06:57 Re: IN clause behaving badly with missing comma and line break