Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From: Amir Rohan <amir(dot)rohan(at)zoho(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
Date: 2015-10-13 12:30:29
Message-ID: 561CF965.2040100@zoho.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/11/2015 03:20 AM, Peter Geoghegan wrote:
> On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
> <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>> My test cases are:
>
> Note that my text caching and unsigned integer comparison patches have
> moved the baseline down quite noticeably. I think that my mobile
> processor out-performs the Xeon you used for this, which seems a
> little odd even taken the change in baseline performance into account.
>

To add a caveat not yet mentioned, the idea behind prefetching is to
scarifice spare memory bandwidth for performance. That can be
a winning bet on a quiet box (the one we benchmark on), but can backfire
on production db when the extra memory pressure can degrade all running
queries. Something to test for, or at least keep in mind.

>> set work_mem ='1GB';
>> create table t1 as select md5(random()::text) from
>> generate_series(1,10000000);
>>
>> Times are in milliseconds. Median and average over 10 runs.
>>
>> Test 1
>

I am the reluctant owner of outmoded hardware. Namely a core2 from
around 2007 on plain spinning metal. My results (linux 64bit):

------
Test 1
------
set work_mem ='1GB';
select count(distinct md5) from t1;

== Master ==
42771.040 ms <- outlier?
41704.570 ms
41631.660 ms
41421.877 ms

== Patch ==
42469.911 ms <- outlier?
41378.556 ms
41375.870 ms
41118.105 ms
41096.283 ms
41095.705 ms

------
Test 2
------
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

== Master ==
44186.775 ms
44137.154 ms
44111.271 ms
44061.896 ms
44109.122 ms

== Patch ==
44592.590 ms
44403.076 ms
44276.170 ms

very slight difference in an ambiguous direction, but also no perf
catastrophe.

> It's worth considering that for some (possibly legitimate) reason, the
> built-in function call is ignored by your compiler, since GCC has
> license to do that. You might try this on both master and patched
> builds:
>
> ~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
> -ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt
>
> ...
>
> Notably, there is a prefetchnta instruction here.
>

I have verified the prefetech is emitted in the disassembly.

An added benefit of owning outmoded hardware is that the MSR for this
generation is public and I can disable individual prefetcher units by
twiddeling a bit.

Disabling the "HW prefetch" or the "DCU prefetch" units on a pacthed
version gave results that look relatively unchanged, which seems promising.
Disabling them both at once on an unpatched version shows a slowdown of
5-6% in test1 (43347.181, 43898.705, 43399.428). That gives an
indication of maximum potential gains in this direction, for this box
at least.

Finally, I notice my results are 4x slower than everyone else's.
That can be very tough on a man's pride, let me tell you.

Amir

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2015-10-13 12:53:07 Re: [PROPOSAL] DIAGNOSTICS <var> = SKIPPED_ROW_COUNT
Previous Message Ashutosh Bapat 2015-10-13 11:16:19 Re: Getting sorted data from foreign server