Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Date: 2023-02-17 17:52:01
Message-ID: 20230217175201.lspbj47wnjxhastu@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> I didn't hear it mentioned explicitly here, but I suspect it's faster
> when increasing the initial size due to the memory context caching
> code that reuses aset MemoryContexts (see context_freelists[] in
> aset.c). Since we reset the context before caching it, then it'll
> remain fast when we can reuse a context, provided we don't need to do
> a malloc for an additional block beyond the initial block that's kept
> in the cache.

I'm not so sure this is the case. Which is one of the reasons I'd really like
to see a) memory context stats for executor context b) a CPU profile of the
problem c) a reproducer.

Jonah, did you just increase the initial size, or did you potentially also
increase the maximum block size?

And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a
larger block size in CreateExecutorState()? If the latter,the context
freelist wouldn't even come into play.

A 8MB max block size is pretty darn small if you have a workload that ends up
with a gigabytes worth of blocks.

And the problem also could just be that the default initial blocks size takes
too long to ramp up to a reasonable block size. I think it's 20 blocks to get
from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE. Even if you
allocate a good bit more than 8MB, having to additionally go through 20
smaller chunks is going to be noticable until you reach a good bit higher
number of blocks.

> Maybe we should think of a more general-purpose way of doing this
> caching which just keeps a global-to-the-process dclist of blocks
> laying around. We could see if we have any free blocks both when
> creating the context and also when we need to allocate another block.

Not so sure about that. I suspect the problem could just as well be the
maximum block size, leading to too many blocks being allocated. Perhaps we
should scale that to a certain fraction of work_mem, by default?

Either way, I don't think we should go too deep without some data, too likely
to miss the actual problem.

> I see no reason why this couldn't be shared among the other context
> types rather than keeping this cache stuff specific to aset.c. slab.c
> might need to be pickier if the size isn't exactly what it needs, but
> generation.c should be able to make use of it the same as aset.c
> could. I'm unsure what'd we'd need in the way of size classing for
> this, but I suspect we'd need to pay attention to that rather than do
> things like hand over 16MBs of memory to some context that only wants
> a 1KB initial block.

Possible. I can see something like a generic "free block" allocator being
useful. Potentially with allocating the underlying memory with larger mmap()s
than we need for individual blocks.

Random note:

I wonder if we should having a bitmap (in an int) in front of aset's
freelist. In a lot of cases we incur plenty cache misses, just to find the
freelist bucket empty.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-02-17 18:00:41 Re: Move defaults toward ICU in 16?
Previous Message Andres Freund 2023-02-17 17:31:14 Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?