Re: Using per-transaction memory contexts for storing decoded tuples

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using per-transaction memory contexts for storing decoded tuples
Date: 2024-10-01 19:58:03
Message-ID: CAD21AoC8wjuSPdTyhQ9y6JQyemLVNv5XiUX=eufFnL_P6X02vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 1, 2024 at 5:15 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Sep 27, 2024 at 10:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> > >
> > > On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > > > > > > In general, it's a bit annoying to have to code around this
> > > > > > > GenerationContext fragmentation issue.
> > > > > >
> > > > > > Right, and I am also slightly afraid that this may not cause some
> > > > > > regression in other cases where defrag wouldn't help.
> > > > >
> > > > > Yeah, that's certainly a possibility. I was hoping that
> > > > > MemoryContextMemAllocated() being much larger than logical_work_mem
> > > > > could only happen when there is fragmentation, but certainly, you
> > > > > could be wasting effort trying to defrag transactions where the
> > > > > changes all arrive in WAL consecutively and there is no
> > > > > defragmentation. It might be some other large transaction that's
> > > > > causing the context's allocations to be fragmented. I don't have any
> > > > > good ideas on how to avoid wasting effort on non-problematic
> > > > > transactions. Maybe there's something that could be done if we knew
> > > > > the LSN of the first and last change and the gap between the LSNs was
> > > > > much larger than the WAL space used for this transaction. That would
> > > > > likely require tracking way more stuff than we do now, however.
> > > > >
> > > >
> > > > With more information tracking, we could avoid some non-problematic
> > > > transactions but still, it would be difficult to predict that we
> > > > didn't harm many cases because to make the memory non-contiguous, we
> > > > only need a few interleaving small transactions. We can try to think
> > > > of ideas for implementing defragmentation in our code if we first can
> > > > prove that smaller block sizes cause problems.
> > > >
> > > > > With the smaller blocks idea, I'm a bit concerned that using smaller
> > > > > blocks could cause regressions on systems that are better at releasing
> > > > > memory back to the OS after free() as no doubt malloc() would often be
> > > > > slower on those systems. There have been some complaints recently
> > > > > about glibc being a bit too happy to keep hold of memory after free()
> > > > > and I wondered if that was the reason why the small block test does
> > > > > not cause much of a performance regression. I wonder how the small
> > > > > block test would look on Mac, FreeBSD or Windows. I think it would be
> > > > > risky to assume that all is well with reducing the block size after
> > > > > testing on a single platform.
> > > > >
> > > >
> > > > Good point. We need extensive testing on different platforms, as you
> > > > suggest, to verify if smaller block sizes caused any regressions.
> > >
> > > I did similar tests on Windows. rb_mem_block_size was changed from 8kB
> > > to 8MB. Below table shows the result (average of 5 runs) and Standard
> > > Deviation (of 5 runs) for each block-size.
> > >
> > > ===============================================
> > > block-size | Average time (ms) | Standard Deviation (ms)
> > > -------------------------------------------------------------------------------------
> > > 8kb | 12580.879 ms | 144.6923467
> > > 16kb | 12442.7256 ms | 94.02799006
> > > 32kb | 12370.7292 ms | 97.7958552
> > > 64kb | 11877.4888 ms | 222.2419142
> > > 128kb | 11828.8568 ms | 129.732941
> > > 256kb | 11801.086 ms | 20.60030913
> > > 512kb | 12361.4172 ms | 65.27390105
> > > 1MB | 12343.3732 ms | 80.84427202
> > > 2MB | 12357.675 ms | 79.40017604
> > > 4MB | 12395.8364 ms | 76.78273689
> > > 8MB | 11712.8862 ms | 50.74323039
> > > ==============================================
> > >
> > > From the results, I think there is a small regression for small block size.
> > >
> > > I ran the tests in git bash. I have also attached the test script.
> >
> > Thank you for testing on Windows! I've run the same benchmark on Mac
> > (Sonoma 14.7, M1 Pro):
> >
> > 8kB: 4852.198 ms
> > 16kB: 4822.733 ms
> > 32kB: 4776.776 ms
> > 64kB: 4851.433 ms
> > 128kB: 4804.821 ms
> > 256kB: 4781.778 ms
> > 512kB: 4776.486 ms
> > 1MB: 4783.456 ms
> > 2MB: 4770.671 ms
> > 4MB: 4785.800 ms
> > 8MB: 4747.447 ms
> >
> > I can see there is a small regression for small block sizes.
> >
>
> So, decoding a large transaction with many smaller allocations can
> have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
> real workloads, we will have fewer such large transactions or a mix of
> small and large transactions. That will make the overhead much less
> visible. Does this mean that we should invent some strategy to defrag
> the memory at some point during decoding or use any other technique? I
> don't find this overhead above the threshold to invent something
> fancy. What do others think?

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old
branches (if we agree) and to revert the change in case something
happens.

BTW I've read the discussions for inventing generational memory
context[1][2], but I could not find any discussion on the memory block
sizes. It seems that we use 8MB memory blocks from the first patch.

[1] https://www.postgresql.org/message-id/20160706185502.1426.28143%40wrigleys.postgresql.org
[2] https://www.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-10-01 20:12:45 Re: not null constraints, again
Previous Message Daniel Verite 2024-10-01 19:34:20 Re: Fixing backslash dot for COPY FROM...CSV