From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tv(at)fuzzy(dot)cz> |
Subject: | slab allocator performance issues |
Date: | 2021-07-17 19:43:33 |
Message-ID: | 20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.
But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.
slab:
NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
Overhead Command Shared Object Symbol
+ 28.27% postgres postgres [.] SlabAlloc
+ 9.64% postgres bdbench.so [.] bfm_delete
+ 9.03% postgres bdbench.so [.] bfm_set
+ 8.39% postgres bdbench.so [.] bfm_lookup
+ 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 6.88% postgres bdbench.so [.] bfm_delete_leaf
+ 3.24% postgres libc-2.31.so [.] _int_malloc
+ 2.58% postgres bdbench.so [.] bfm_tests
+ 2.33% postgres postgres [.] SlabFree
+ 1.29% postgres libc-2.31.so [.] _int_free
+ 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0
aset:
NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
+ 16.43% postgres bdbench.so [.] bfm_lookup
+ 15.38% postgres bdbench.so [.] bfm_delete
+ 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 12.65% postgres bdbench.so [.] bfm_set
+ 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 10.57% postgres bdbench.so [.] bfm_delete_leaf
+ 4.05% postgres bdbench.so [.] bfm_tests
+ 2.93% postgres [kernel.vmlinux] [k] clear_page_erms
+ 1.59% postgres postgres [.] AllocSetAlloc
+ 1.15% postgres bdbench.so [.] memmove(at)plt
+ 1.06% postgres bdbench.so [.] bfm_grow_leaf_16
OS:
NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...
This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.
1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.
I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.
I don't see a way to get around the division while keeping the freelist
structure as is. But:
ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?
I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).
3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.
Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2021-07-17 19:53:07 | Re: slab allocator performance issues |
Previous Message | Andy Fan | 2021-07-17 19:38:58 | Re: UniqueKey on Partitioned table. |