> Have you considered GPU-based sorting? I know there's been discussion
in the past.
If you use OpenCL, then you can use a CPU driver if there is no GPU, and
that can allow you to leverage all the CPU cores without having to do
the multi-thread stuff in the backend.
While the compilation of a specific kernel can be quite expensive, it
also has the effect of a JIT compiler in terms of system independence.