Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-17 07:49:33
Message-ID: CAD21AoCdnyT+8Zah7JwNsAjebUw65gpG3tqdBHEi7p7rpWZVig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > > I wrote:
> > >
> > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
>
> > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
>
> > IIUC you suggested measuring memory usage by tracking how much memory
> > chunks are allocated within a block. If your idea at the top of the
> > page follows this method, it still doesn't deal with the point Andres
> > mentioned.
>
> Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

>
> However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
>
> I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:
>
> m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
>
> 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
>
> 766 + 2*(128) + 64 = 1086MB -> stop
>
> That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).

This is an interesting idea. But I'm concerned we don't have enough
time to get confident with adding this new concept to DSA.

>
> And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.

Another problem we need to deal with is the supported minimum memory
in shared tidstore cases. Since the initial DSA segment size is 1MB,
memory usage of a shared tidstore will start from 1MB+. This is higher
than the minimum values of both work_mem and maintenance_work_mem,
64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems
to be acceptable in the community but not for work_mem. One idea is to
deny the memory limit less than 2MB so it won't work with small m_w_m
settings. While it might be an acceptable restriction at this stage
(where there is no use case of using tidstore with work_mem in the
core) but it will be a blocker for the future adoptions such as
unifying with tidbitmap.c. Another idea is that the process can
specify the initial segment size at dsa_create() so that DSA can start
with a smaller segment, say 32kB. That way, a tidstore with a 32kB
limit gets full once it allocates the next DSA segment, 32kB. . But a
downside of this idea is to increase the number of segments behind
DSA. Assuming it's a relatively rare case where we use such a low
work_mem, it might be acceptable. FYI, the total number of DSM
segments available on the system is calculated by:

#define PG_DYNSHMEM_FIXED_SLOTS 64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND 5

maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-03-17 08:06:05 Re: gcc 13 warnings
Previous Message Önder Kalacı 2023-03-17 07:38:02 Re: Dropped and generated columns might cause wrong data on subs when REPLICA IDENTITY FULL