From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tv(at)fuzzy(dot)cz> |
Subject: | Re: slab allocator performance issues |
Date: | 2021-07-17 22:46:09 |
Message-ID: | 02ff3ced-3212-aea1-c7a0-4829f43e58d2@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 7/17/21 11:14 PM, Andres Freund wrote:
> Hi,
>
> On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
>> On 7/17/21 9:43 PM, Andres Freund wrote:
>>> 1) If allocations are short-lived slab.c, can end up constantly
>>> freeing/initializing blocks. Which requires fairly expensively iterating over
>>> all potential chunks in the block and initializing it. Just to then free that
>>> memory again after a small number of allocations. The extreme case of this is
>>> when there are phases of alloc/free of a single allocation.
>>>
>>> I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
>>> problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
>>> only works if the problem is with the only, so it's not a great
>>> approach. Perhaps just keeping the last allocated block around would work?
>>>
>>
>> +1
>>
>> I think it makes perfect sense to not free the blocks immediately, and keep
>> one (or a small number) as a cache. I'm not sure why we decided not to have
>> a "keeper" block, but I suspect memory consumption was my main concern at
>> that point. But I never expected the cost to be this high.
>
> I think one free block might be too low in some cases. It's pretty
> common to have workloads where the number of allocations is "bursty",
> and it's imo one case where one might justifiably want to use a slab
> allocator... Perhaps a portion of a high watermark? Or a portion of the
> in use blocks?
>
I think the portion of watermark would be problematic for cases with one
huge transaction - that'll set a high watermark, and we'll keep way too
many free blocks. But the portion of in use blocks might work, I think.
> Hm. I wonder if we should just not populate the freelist eagerly, to
> drive down the initialization cost. I.e. have a separate allocation path
> for chunks that have never been allocated, by having a
> SlabBlock->free_offset or such.
>
> Sure, it adds a branch to the allocation happy path, but it also makes the
> first allocation for a chunk cheaper, because there's no need to get the next
> element from the freelist (adding a likely cache miss). And it should make the
> allocation of a new block faster by a lot.
>
Not sure what you mean by 'not populate eagerly' so can't comment :-(
>
>>> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
>>> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
>>> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
>>> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
>>> still slow enough there to be very worrisome.
>>>
>>> I don't see a way to get around the division while keeping the freelist
>>> structure as is. But:
>>>
>>> ISTM that we only need the index because of the free-chunk list, right? Why
>>> don't we make the chunk list use actual pointers? Is it concern that that'd
>>> increase the minimum allocation size? If so, I see two ways around that:
>>> First, we could make the index just the offset from the start of the block,
>>> that's much cheaper to calculate. Second, we could store the next pointer in
>>> SlabChunk->slab/block instead (making it a union) - while on the freelist we
>>> don't need to dereference those, right?
>>>
>>> I suspect both would also make the block initialization a bit cheaper.
>>>
>>> That should also accelerate SlabBlockGetChunk(), which currently shows up as
>>> an imul, which isn't exactly fast either (and uses a lot of execution ports).
>>>
>>
>> Hmm, I think you're right we could simply use the pointers, but I have not
>> tried that.
>
> I quickly tried that, and it does seem to improve matters considerably. The
> block initialization still shows up as expensive, but not as bad. The div and
> imul are gone (exept in an assertion build right now). The list manipulation
> still is visible.
>
Understood. I didn't expect this to be a full solution.
>
>>> 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
>>> shows up prominently in profiles. That's ~tripling the number of cachelines
>>> touched in the happy path, with unpredictable accesses to boot.
>>>
>>> Perhaps we could reduce the precision of slab->freelist indexing to amortize
>>> that cost? I.e. not have one slab->freelist entry for each nfree, but instead
>>> have an upper limit on the number of freelists?
>>>
>>
>> Yeah. The purpose of organizing the freelists like this is to prioritize the
>> "more full" blocks" when allocating new chunks, in the hope that the "less
>> full" blocks will end up empty and freed faster.
>>
>> But this is naturally imprecise, and strongly depends on the workload, of
>> course, and I bet for most cases a less precise approach would work just as
>> fine.
>>
>> I'm not sure how exactly would the upper limit you propose work, but perhaps
>> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
>> So after the alloc/free we'd calculate the new freelist index as (nfree/16)
>> and only moved the block if the index changed. This would reduce the
>> overhead to 1/16 and probably even more in practice.
>
>> Of course, we could also say we have e.g. 8 freelists and work the ranges
>> backwards from that, I guess that's what you propose.
>
> Yea, I was thinking something along those lines. As you say, ow about there's
> always at most 8 freelists or such. During initialization we compute a shift
> that distributes chunksPerBlock from 0 to 8. Then we only need to perform list
> manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift.
>
> That seems nice from a memory usage POV as well - for small and frequent
> allocations needing chunksPerBlock freelists isn't great. The freelists are on
> a fair number of cachelines right now.
>
Agreed.
>
>>> 4) Less of a performance, and more of a usability issue: The constant
>>> block size strikes me as problematic. Most users of an allocator can
>>> sometimes be used with a small amount of data, and sometimes with a
>>> large amount.
>>>
>>
>> I doubt this is worth the effort, really. The constant block size makes
>> various places much simpler (both to code and reason about), so this should
>> not make a huge difference in performance. And IMHO the block size is mostly
>> an implementation detail, so I don't see that as a usability issue.
>
> Hm? It's something the user has to specify, so I it's not really an
> implementation detail. It needs to be specified without sufficient
> information, as well, since externally one doesn't know how much memory the
> block header and chunk headers + rounding up will use, so computing a good
> block size isn't easy. I've wondered whether it should just be a count...
>
I think this is mixing two problems - how to specify the block size, and
whether the block size is constant (as in slab) or grows over time (as
in allocset).
As for specifying the block size, I agree maybe setting chunk count and
deriving the bytes from that might be easier to use, for the reasons you
mentioned.
But growing the block size seems problematic for long-lived contexts
with workloads that change a lot over time - imagine e.g. decoding many
small transactions, with one huge transaction mixed in. The one huge
transaction will grow the block size, and we'll keep using it forever.
But in that case we might have just as well allocate the large blocks
from the start, I guess.
> Why do you not think it's relevant for performance? Either one causes too much
> memory usage by using a too large block size, wasting memory, or one ends up
> loosing perf through frequent allocations?
>
True. I simply would not expect this to make a huge difference - I may
be wrong, and I'm sure there are workloads where it matters. But I still
think it's easier to just use larger blocks than to make the slab code
more complex.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2021-07-17 23:10:19 | Re: slab allocator performance issues |
Previous Message | Ranier Vilela | 2021-07-17 21:40:02 | Re: corruption of WAL page header is never reported |