From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: things I learned from working on memory allocation |
Date: | 2014-07-13 18:23:18 |
Message-ID: | 20140713182318.GH1136@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Robert,
On 2014-07-07 15:57:00 -0400, Robert Haas wrote:
> 1. I tried to write a single allocator which would be better than
> aset.c in two ways: first, by allowing allocation from dynamic shared
> memory; and second, by being more memory-efficient than aset.c.
> Heikki told me at PGCon that he thought that was a bad idea, and I now
> think he was right.
I agree with that as well. I am wondering whether it's possible to use
most of the same code and compile it once as a per process allocator and
once as the allocator for shared memory.
> Allocating from dynamic shared memory requires
> the use of relative rather than absolute pointers (because the DSM
> could be mapped at different addresses in different processes) and, if
> concurrent allocating and freeing is to be supported, locking.
> Although neither of these things requires very many extra
> instructions, the common path for aset.c is extremely short, and we
> haven't got those instructions to spare. Code like "if (lock != NULL)
> LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the
> branch is never taken, apparently because being prepared to call a
> function at that point requires storing more stuff in our stack frame,
> and extra instructions to save and restore registers are a painfully
> expensive on allocation-heavy workloads. On PPC64, to match aset.c's
> performance, a palloc/pfree cycle must run in ~120 instructions, ~80
> cycles, which doesn't leave much room for fluff.
The actual if (lock != NULL) bit costs significant amounts of cycles?
I'd have assumed that branch prediction takes care of that. Or is it
actually the icache not keeping up? Did you measure icache vs. dcache
misses?
Have you played with unlikely()/likely() type of macros?
I don't think that any such fixes changes will make enough of a
difference to negate the point that these simply are two different
allocators with different requirements, but it's still interesting.
> 2. After ripping the relative-pointer and locking stuff out of my
> allocator, I came up with something which performs about like aset.c
> on allocation, but with significantly better memory efficiency on
> small allocations. REINDEX pgbench_accounts_pkey saves about 22%,
> because the SortTuple array saves nothing but the individual tuples
> take only half as much space, by avoiding the StandardChunkHeader
> overhead. This seems like a savings worth pursuing, if we can figure
> out how to get it.
Yes, that'd certainly be nice.
> Unfortunately, there's some incremental overhead
> in pfree, amounting to a single test of global variable, even when the
> new allocator is never used, which is measurable on a microbenchmark;
> I don't remember the exact numbers off-hand.
Hm. Why's that needed? Because you're searching for your allocator's
superblocks in pfree()? I guess you don't store information about the
size/context for every allocatation like aset.c does?
> When the new allocator
> *is* used, pfree gets a lot slower - mostly because one of the data
> structures used by the new allocator suffer occasional cache misses
> during free operations. I think there might be an argument that some
> hit on pfree would be worth taking in exchange for better
> space-efficiency, because we mostly free contexts by resetting them
> anyway; but I haven't yet managed to make that hit small enough to
> feel especially good about it.
The speed bump is hit when pfree()ing a allocation that's been allocated
with the new allocator or also when there's any concurrent activity for
that allocator?
> 3. The current MemoryContext abstraction layer is inadequate for
> almost everything one might want to do. The idea of a context that
> allows only allocation, and ignores requests to free memory, has been
> discussed more than once. Such an allocator could skip the
> power-of-two rounding that aset.c does, but it couldn't omit the chunk
> header, which means that in practice it would save no memory at all
> for cases like the one mentioned above (REINDEX
> pgbench_accounts_pkey).
I personally think that the performance benefit of not doing the
rounding, not accessing the freelist, and such is more interesting for
such an allocator than the memory savings. I'm pretty sure such a
'allocation only' allocator for e.g. the parser and parts of the
executor would be a rather noticeable performance benefit.
But even for the cases where the space savings are the important bit: To
me it sounds feasible to declare that some allocators don't allow
reallocations and freeing. Those then would be allowed to omit the chunk
header. To make that debuggable we would need to add a chunk header
when assertions are enabled to error out when such an operation is
performed - but that's a acceptable price imo.
Btw, am I the only one that wondered if it wouldn't be rather
beneficial to make aset.c add the chunk header before rounding?
> While there might be some benefit in other
> cases, without getting rid of or at least reducing the size of the
> chunk-header, I expect that the benefits will be too narrow to make
> this approach worth pursuing. The chunk-header requirement is also a
> killer for DSM, again because segments can be mapped at different
> addresses in different backends. I'm inclined to think that if we
> can't find a way to get rid of the chunk-header requirement, we ought
> to consider ripping out the whole abstraction layer as a
> performance-wasting exercise in futility.
FWIW, I have done that in an experimental patch and at least on x86-64
the performance benefits were neglegible. Didn't evaluate memory
efficiency though.
> 4. The MemoryContext layer embeds assumptions not only about the
> layout of memory, but the performance of various operations. For
> example, GetMemoryContextSpace() is assumed to be cheap, so there's no
> interface like void *MemoryContextAlloc(Size request_size, Size
> *chunk_size) or Size MemoryContextFree(Size chunk_size), but those
> would be significantly more efficient for my new allocator.
I think mcxct.c is at least partially right in that. An memory
allocation interface that requires keeping track of the current
allocation size in userland *sucks*. Sure there are some cases where
that's not too much of a bother, but in the majority of cases it'll made
code harder to read and quite possibly just move the overhead in lots
more places.
I don't really see a problem with tweaking the interface so it allows
MemoryContextAlloc to immediately return the actually allocated size -
if we're doing that in a way that doesn't break existing users - why not?
> Possibly
> better still would be to have the context itself track utilization and
> soft-fail when we run out of space; even for aset.c, it makes little
> sense to refuse to accept another tuple when the AllocSet has a
> suitable object on the freelist. That's probably not a critical flaw
> because, since tuples being sorted are likely all about the same size,
> the waste may be little or nothing. Other allocators might have
> different space-management strategies, though, where this matters
> more.
I can't really figure out what you're talking about here.
> 5. It's tempting to look at other ways of solving the parallel sort
> problem that don't need an allocator - perhaps by simply packing all
> the tuples into a DSM one after the next. But this is not easy to do,
> or at least it's not easy to do efficiently. Tuples enter tuplesort.c
> through one of the tuplesort_put*() functions, most of which end up
> calling one of the copytup_*() functions. But you can't easily just
> change those functions to stick the tuple someplace else, because they
> don't necessarily get the address of the space to be used from palloc
> directly. In particular, copytup_heap() calls
> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
> heap_copytuple(). heap_copytuple() is simple enough that you might be
> able to finagle a new API that would make it work, but I'm not sure
> what we could really do about ExecCopySlotMinimalTuple() except call
> it and then copy the result. Perhaps that'd be OK for a first
> version.
This sounds like a problem that can be solved in a relatively localized
fashion. I think it'd be fair enough to provide one function that does
it efficiently and just fall back to ocpying for everything else.
> 6. In general, I'm worried that it's going to be hard to keep the
> overhead of parallel sort from leaking into the non-parallel case.
> With the no-allocator approach, every place that uses
> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> DSM and non-DSM cases differently, which isn't great for either
> performance or maintainability. And even with an allocator, the
> SortTuple array will need to use relative pointers in a DSM; that
> might burden the non-DSM case.
Yes, I share that concern.
I somewhat doubt that tuplesort.c really is the right layer to do the
parallel part of parallel sort including the chunked storage. Why not
pack the tuples outside it and teach tuplesort.c to not copy tuples for
some _put* routine? Then tuplesort can entirely happen inside a single
process and doesn't have to worry about indirect pointers and anything?
Yes, it'll need slightly more memory, but that doesn't sound too bad.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2014-07-13 18:59:49 | Re: tweaking NTUP_PER_BUCKET |
Previous Message | Tom Lane | 2014-07-13 15:00:34 | Re: Allowing NOT IN to use ANTI joins |