Re: 回复: An implementation of multi-key sort

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Yao Wang <yao-yw(dot)wang(at)broadcom(dot)com>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Wang Yao <yaowangm(at)outlook(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Hongxu Ma <hongxu(dot)ma(at)broadcom(dot)com>
Subject: Re: 回复: An implementation of multi-key sort
Date: 2024-07-08 14:40:30
Message-ID: aefdfd09-df1b-4613-9157-fdd7ca38c734@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Thanks for posting a new version of the patch, and for reporting a bunch
of issues in the bash scripts I used for testing. I decided to repeat
those fixed tests on both the old and new version of the patches, and I
finally have the results from three machines (the i5/xeon I usually use,
and also rpi5 for fun).

The complete scripts, raw results (CSV), and various reports (ODS and
PDF) are available in my github:

https://github.com/tvondra/mksort-tests

I'm not going to attach all of it to this message, because the raw CSV
results alone are ~3MB for each of the three machines.

You can do your own analysis on the raw CSV results, of course - see the
'csv' directory, there are data for the clean branch and the two patch
versions.

But I've also prepared PDF reports comparing how the patches work on
each of the machines - see the 'pdf' directory. There are two types of
reports, depending on what's compared to what.

The general report structure is the same - columns with results for
different combinations of parameters, followed by comparison of the
results and a heatmap (red - bad/regression, green - good/speedup).

The "patch comparison" reports compare v5/v4, so it's essentially

(timing with v5) / (timing with v4)

with the mksort enabled or disabled. And the charts are pretty green,
which means v5 is much faster than v4 - so seems like a step in the
right direction.

The "patch impact" reports compare v4/master and v5/master, i.e. this is
what the users would see after an upgrade. Attached is an small example
from the i5 machine, but the other machines behave in almost exactly the
same way (including the tiny rpi5).

For v4, the results were not great - almost everything regressed (red
color), except for the "text" data type (green).

You can immediately see v5 does much better - it still regresses, but
the regressions are way smaller. And the speedup for "text" it actually
a bit more significant (there's more/darker green).

So as I said before, I think v5 is definitely moving in the right
direction, but the regressions still seem far too significant. If you're
sorting a lot of text data, then sure - this will help a lot. But if
you're sorting int data, and it happens to be random/correlated, you're
going to pay 10-20% more. That's not great.

I haven't analyzed the code very closely, and I don't have a great idea
on how to fix this. But I think to make this patch committable, this
needs to be solved.

Considering the benefits seems to be pretty specific to "text" (and
perhaps some other data types), maybe the best solution would be to only
enable this for those cases. Yes, there are some cases where this helps
for the other data types too, but that also comes with the regressions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
mksort-v4-vs-v5.pdf application/pdf 222.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-08 14:42:20 Re: array_in sub function ReadArrayDimensions error message
Previous Message Nathan Bossart 2024-07-08 14:37:04 Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables