From: | Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com> |
Subject: | Re: Sort optimizations: Making in-memory sort cache-aware |
Date: | 2023-02-12 08:13:27 |
Message-ID: | 04a0dc6c-b789-67fe-f3c6-b2a8e8a36509@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On 12/02/23 01:59, Andres Freund wrote:
> However, tuplesort.c completely breaks that for > 1 column
> sorts. While spatial locality for accesses to the ->memtuples array is decent
> during sorting, due to qsort's subdividing of the problem, the locality for
> access to the tuples is *awful*.
> The memtuples array is reordered while sorting, but the tuples themselves
> aren't. Unless the input data is vaguely presorted, the access pattern for the
> tuples has practically zero locality.
This probably explains the issue.
> One idea is to keep track of the distinctness of the first column sorted and
> to behave differently if it's significantly lower than the number of to be
> sorted tuples. E.g. by doing a first sort solely on the first column, then
> reorder the MinimalTuples in memory, and then continue normally.
> There's two main problems with that idea:
> 1) It's hard to re-order the tuples in memory, without needing substantial
> amounts of additional memory
> 2) If the second column also is not very distinct, it doesn't buy you much, if
> anything.
> But it might provide sufficient benefits regardless. And a naive version,
> requiring additional memory, should be quick to hack up.
I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?
> Most sorting papers don't discuss
> variable-width data, nor a substantial amount of cache-polluting work while
> gathering the data that needs to be sorted.
I definitely agree with this.
> My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
> just those using the existing code, allocate new memory for all the
> corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
> that, obviously in ->memtuples order.
This should be easy hack and we can easily profile benefits from this.
> It's very likely we can do better than just doing a plain sort of everything
> after that.
> You effectively end up with a bounded number of pre-sorted blocks, so the most
> obvious thing to try is to build a heap of those blocks and effectively do a
> heapsort between the presorted blocks.
This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.
> A related, but separate, improvement is to reduce / remove the memory
> allocation overhead. The switch to GenerationContext helped some, but still
> leaves a bunch of overhead. And it's not used for bounded sorts right now.
> We don't palloc/pfree individual tuples during a normal sorts, but we do have
> some, for bounded sorts. I think with a reasonable amount of work we could
> avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
> allocator, getting rid of all allocator overhead.
> The biggest source of individual pfree()s in the bounded case is that we
> unconditionally copy the tuple into base->tuplecontext during puttuple. Even
> though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.
> We could instead pre-check that the tuple won't immediately be discarded,
> before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
> course.
This Looks doable, try this.
> I think we also can replace the individual freeing of tuples in
> tuplesort_heap_replace_top(), by allowing a number of dead tuples to
> accumulate (up to work_mem/2 maybe), and then copying the still living tuples
> into new memory context, freeing the old one.
> While that doesn't sound cheap, in bounded sorts the number of tuples commonly
> is quite limited, the pre-check before copying the tuple will prevent this
> from occurring too often, the copy will result in higher locality, and, most
> importantly, the reduced palloc() overhead (~25% or so with aset.c) will
> result in a considerably higher cache hit ratio / lower memory usage.
I would try this as well.
> There are things we could do to improve upon this that don't require swapping
> out our sorting implementation wholesale.
Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort
implementation and with potentially good amount of improvements). I will give these a try
and see if I could get some performance gains.
Thanks,
Ankit
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2023-02-12 10:43:38 | Re: Making aggregate deserialization (and WAL receive) functions slightly faster |
Previous Message | Tom Lane | 2023-02-12 06:39:13 | Re: Making aggregate deserialization (and WAL receive) functions slightly faster |