Re: Sort optimizations: Making in-memory sort cache-aware

From: Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: Sort optimizations: Making in-memory sort cache-aware
Date: 2023-02-12 08:13:27
Message-ID: 04a0dc6c-b789-67fe-f3c6-b2a8e8a36509@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On 12/02/23 01:59, Andres Freund wrote:

> However, tuplesort.c completely breaks that for > 1 column
> sorts. While spatial locality for accesses to the ->memtuples array is decent
> during sorting, due to qsort's subdividing of the problem, the locality for
> access to the tuples is *awful*.

> The memtuples array is reordered while sorting, but the tuples themselves
> aren't. Unless the input data is vaguely presorted, the access pattern for the
> tuples has practically zero locality.

This probably explains the issue.

> One idea is to keep track of the distinctness of the first column sorted and
> to behave differently if it's significantly lower than the number of to be
> sorted tuples. E.g. by doing a first sort solely on the first column, then
> reorder the MinimalTuples in memory, and then continue normally.

> There's two main problems with that idea:
> 1) It's hard to re-order the tuples in memory, without needing substantial
> amounts of additional memory
> 2) If the second column also is not very distinct, it doesn't buy you much, if
> anything.

> But it might provide sufficient benefits regardless. And a naive version,
> requiring additional memory, should be quick to hack up.

I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?

> Most sorting papers don't discuss
> variable-width data, nor a substantial amount of cache-polluting work while
> gathering the data that needs to be sorted.

I definitely agree with this.

> My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
> just those using the existing code, allocate new memory for all the
> corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
> that, obviously in ->memtuples order.

This should be easy hack and we can easily profile benefits from this.

> It's very likely we can do better than just doing a plain sort of everything
> after that.
> You effectively end up with a bounded number of pre-sorted blocks, so the most
> obvious thing to try is to build a heap of those blocks and effectively do a
> heapsort between the presorted blocks.

This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.

> A related, but separate, improvement is to reduce / remove the memory
> allocation overhead. The switch to GenerationContext helped some, but still
> leaves a bunch of overhead. And it's not used for bounded sorts right now.
> We don't palloc/pfree individual tuples during a normal sorts, but we do have
> some, for bounded sorts. I think with a reasonable amount of work we could
> avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
> allocator, getting rid of all allocator overhead.

> The biggest source of individual pfree()s in the bounded case is that we
> unconditionally copy the tuple into base->tuplecontext during puttuple. Even
> though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

> We could instead pre-check that the tuple won't immediately be discarded,
> before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
> course.

This Looks doable, try this.

> I think we also can replace the individual freeing of tuples in
> tuplesort_heap_replace_top(), by allowing a number of dead tuples to
> accumulate (up to work_mem/2 maybe), and then copying the still living tuples
> into new memory context, freeing the old one.

> While that doesn't sound cheap, in bounded sorts the number of tuples commonly
> is quite limited, the pre-check before copying the tuple will prevent this
> from occurring too often, the copy will result in higher locality, and, most
> importantly, the reduced palloc() overhead (~25% or so with aset.c) will
> result in a considerably higher cache hit ratio / lower memory usage.

I would try this as well.

> There are things we could do to improve upon this that don't require swapping
> out our sorting implementation wholesale.

Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort
implementation and with potentially good amount of improvements). I will give these a try
and see if I could get some performance gains.

Thanks,
Ankit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-02-12 10:43:38 Re: Making aggregate deserialization (and WAL receive) functions slightly faster
Previous Message Tom Lane 2023-02-12 06:39:13 Re: Making aggregate deserialization (and WAL receive) functions slightly faster