Re: branch-free tuplesort partitioning

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: branch-free tuplesort partitioning
Date: 2025-02-10 08:18:34
Message-ID: CANWCAZaQqa5MCbu+L4NWGtnYqNPM6Keqj_2GJz0cp0nFtcm0mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I benchmarked this a couple weeks ago, and just now got around to
putting the results in a presentable form.

I took some of the more interesting cases from the B&M test Peter
shared upthread (best and worst performers in a quick test) and ran
them with a range of "m" values.

This is exploratory work, so the attached script doesn't quite
correspond to what's in the spreadsheet, but it gives some idea of the
details. I used 1 million records for everything, with prewarm and
turbo off.

-Random with modulus
This is the best case, beating our current sort on all but the lowest
cardinality inputs (which are only slightly slower).

-Sawtooth
This is the worst case, worse for all saw sizes. The case with 30 saws
is particularly bad. I tried adding a recursion stopping heap sort,
but that made things worse. I had hoisted the presorted check out of
the recursion steps because it used the full comparator (for
simplicity), and adding back a specialized check for every recursion
helped somewhat, but not enough to matter.

-Stagger [value = (i*multiplier + i) % num_records]
No clear winner here.

-Miscellaneous cases
Random and ascending are repeated in the spreadsheet for reference.
-Zipfian is from the numpy random number generator using 1.01 as the
parameter. Patch looks good here.
-"s95" is 95% ascending with 5% random tail. This regresses a bit.
-"p5" is a best case for our current sort: 95% zeros, 2.5% random
negative values, and 2.5% random positive values. The first pivot is
pretty much guaranteed to be zero, so in master all zeros are moved to
the middle partition upon first recursion. The patch takes two passes
to bucket all zeros together (because branchless comparisons with
single machine instructions must be 2-way and not 3-way), but in the
grand scheme of things it amounts to only a small regression.
-Organ regresses substantially.
-Descending regresses but not as much as I expected since descending
input is supposed to be a bad case for Lomuto partitioning schemes.

These are mixed results, so I'm not inclined to pursue this further
this cycle. There is one thing which could tip the scales in its favor
in the future: The planner doesn't yet know about the specialized
comparators using single machine instructions for the first key. The
patch has a fairly tight range for (almost) unique inputs. Whether
it's faster or slower than master, these inputs very often measure
about the same. This is not surprising because there are no branches
to mispredict during partitioning. Having predictable worst cases
would good for the planner.

One thing I didn't test is multiple keys (or resolving ties of an
abreviated first key). When the patch buckets duplicates together, it
must (if necessary) recurse to that partition to resolve the
tiebreaks. That was simple to code, but untested for either
correctness or performance. It's a bit like the multi-key sort
proposal in this regard, but with only two distinct steps:

https://www.postgresql.org/message-id/PH7P220MB1533DA211DF219996760CBB7D9EB2%40PH7P220MB1533.NAMP220.PROD.OUTLOOK.COM

--
John Naylor
Amazon Web Services

Attachment Content-Type Size
branchless-lomuto-20250210.ods application/vnd.oasis.opendocument.spreadsheet 38.1 KB
BL-perf-test-examples.sh application/x-shellscript 3.1 KB
v2-0001-Branchless-Lomuto-partitioning.patch text/x-patch 9.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2025-02-10 08:22:22 Re: Adjust tuples estimate for appendrels
Previous Message Ilia Evdokimov 2025-02-10 08:05:29 Re: Sample rate added to pg_stat_statements