Proposal for optimizations with simd enabled sort

From: "R, Rakshit" <rakshit(dot)r(at)intel(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, "Giacchino, Luca" <luca(dot)giacchino(at)intel(dot)com>, "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "Paul, Sourav Kumar" <sourav(dot)kumar(dot)paul(at)intel(dot)com>
Subject: Proposal for optimizations with simd enabled sort
Date: 2025-03-28 20:45:30
Message-ID: CO1PR11MB506011C3052AE779750D9789EDA02@CO1PR11MB5060.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,
We would like to share a proposal for optimizations with simd enabled sort.

Sorting is a common task in Postgres, for example, for sorting tuples (e.g., for queries requiring ordered, grouped, or unique rows) or arrays of elements (e.g., list_sort). We propose integrating SIMD-based sorting to improve performance of these use cases on x86 platforms. With SIMD sort, we observe performance gains in different scenarios, such as ORDER BY queries and pgvector workloads (see results below).

SIMD-enabled sort functions are provided by the x86-simd-sort library [1]. The library includes AVX-512 and AVX2 accelerated sort functions, and it is already enabled in several ecosystems, such as numpy (10-17x speed-up [2]), OpenJDK [3], libgrape (graph processing library), and PyTorch [4].

This approach introduces x86-simd-sort as a dependency in Postgres. However, the dependency is optional, controlled by a configuration option (such as --with-simdsort). If building without the dependency, the SIMD sort functions will not be included and the existing functions will be used.

Quicksort

Postgres provides a quicksort implementation (qsort) with variants supporting different data types generated by a template (sort_template.h). Tuple sort and list sort rely on these qsort functions. Therefore, introducing SIMD implementations of qsort will benefit them.

The existing qsort uses a comparator function. On the other hand, x86-simd-sort requires an array of numeric values to sort (integer or floating point). To generate that array, we need a function to "extract" a numeric value from each of the items being sorted (SortTuple, ListCell...). The array will be in memory and requires memory allocation. A comparator can still be passed to be used as tie-breaker (for additional sorting criteria).

New qsort variants will take an extract function as an additional parameter. For example, a new variant of pg_qsort will be:

void pg_qsort_simd_float(void *base, size_t nel, size_t elsize, int (*cmp) (const void *, const void *), float(*extract) (const void *));

Additional variants will cover different return types for the extract function (double and 16/32/64-bit signed/unsigned integers). These functions will exist alongside the existing pg_qsort (which is unmodified).

We plan to generate SIMD-enabled qsort functions by extending the existing sort template with additional macros (to define the extract function to use). The template will then generate both a quicksort implementation (unmodified) and a SIMD-enabled one.

Tuple Sort

- SIMD sort is supported for integer and floating point columns.
- Extract functions are provided to extract values from each SortTuple (similar to how the type-specific comparator functions are provided). The relevant extract function is passed to the sort template to generate the corresponding SIMD sort function.
- Logic in tuplesort_sort_memtuples controls whether to call the regular qsort or the SIMD version. In general, we observe that the benefit of SIMD sorting decreases with higher tuple count and lower cardinality (see data below). Tuple count is easily available to this logic, but estimating cardinality requires more investigation (for example, using statistics at planning time or query execution time).
- For sorting criteria involving multiple columns, we currently use SIMD sorting for the first column, and we rely on the existing quicksort for other columns.

List Sort

list_sort sorts an array of ListCells using qsort with a comparator. Following a similar approach, we define additional variants that take an extract function. For example

typedef float (*list_sort_extractor_float)(const ListCell *a);
void list_sort_simd_float(List *list, list_sort_extractor_float extract, list_sort_comparator cmp);

Additional variants will cover different return types for the extract function (double and 16/32/64-bit signed/unsigned integers). These functions will exist alongside the existing list_sort (unmodified), and they will call SIMD versions of qsort. Existing list_sort use cases can be converted to list_sort_simd functions where beneficial in terms of performance.

Results

By enabling x86-simd-sort, we observe the following in our preliminary measurements. All data below uses x86-simd-sort with AVX-512. We will share AVX2 data as well.

Tuple sort
ORDER BY queries with randomly-generated data, with varying row count and cardinality (controlled by number of distinct values), single-column filter.
Collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13
100k rows, 100k distinct values: +13.4%
100k rows, 10k distinct values: +9.8%
100k rows, 100 distinct values: +3.0%
100k rows, 10 distinct values: -1.9%
1M rows, 1M distinct values: +6.5%
1M rows, 10k distinct values: +1.6%
1M rows, 100 distinct values: -2.9%
1M rows, 10 distinct values: -5.1%
As mentioned earlier, we observe the benefit of SIMD sorting to decrease with higher row count and lower cardinality. We are exploring additional optimizations to further improve performance (for example, sorting/merging smaller runs is showing promising results).

List sort (pgvector)
pgvector uses list_sort to sort candidate vectors by distance.
Using list_sort_simd_float, we observed 7% reduction in index build time with the gist-960 dataset and 10% with the dbpedia-openai-1000k dataset (ANN-Benchmarks, HNSW index with halfvec, m=80).

list sort (microbenchmark)
We developed a microbenchmark to stress list_sort. We observe gains of 1.75x-2.63x with varying cardinality and clients (refer to [5]). Gains are lower but still present at even lower cardinality. For example, we observe 20% gain with 100k items and 10 distinct values).

In summary, we observe performance gains with x86-simd-sort in several scenarios. There are still open questions to "detect" low-cardinality data and route to the best-performing implementation.
Please let us know your thoughts. If this is a viable approach for integrating SIMD-based sorting, we will complete and submit a first rev of the patch.

[1] https://github.com/intel/x86-simd-sort
[2] https://github.com/numpy/numpy/pull/22315/
[3] https://github.com/openjdk/jdk/pull/14227
[4] https://github.com/pytorch/pytorch/pull/149362
[5] https://www.postgresql.org/message-id/CO1PR11MB506056215E799493EC5F42E8EDFE2%40CO1PR11MB5060.namprd11.prod.outlook.com

Thanks
Rakshit Rajeev

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-03-28 21:22:17 Re: [PATCH] SVE popcount support
Previous Message Peter Eisentraut 2025-03-28 20:34:18 Re: On non-Windows, hard depend on uselocale(3)