From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore |
Date: | 2015-09-02 19:24:33 |
Message-ID: | CAM3SWZSH3dyC6+mYO5Ag39GL=wqorfZFcAaGa3RkixUSweuRSw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> Hm. Is a compiler test actually test anything reliably here? Won't this
> just throw a warning during compile time about an unknown function?
I'll need to look into that.
>> +/*
>> + * Prefetch support -- Support memory prefetching hints on some platforms.
>> + *
>> + * pg_rfetch() is specialized for the case where an array is accessed
>> + * sequentially, and we can prefetch a pointer within the next element (or an
>> + * even later element) in order to hide memory latency. This case involves
>> + * prefetching addresses with low temporal locality. Note that it's rather
>> + * difficult to get any kind of speedup using pg_rfetch(); any use of the
>> + * intrinsic should be carefully tested. Also note that it's okay to pass it
>> + * an invalid or NULL address, although it's best avoided.
>> + */
>> +#if defined(USE_MEM_PREFETCH)
>> +#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0)
>> +#endif
>
> What is that name actually supposed to mean? 'rfetch' doesn't seem to be
> particularly clear - or is it a meant to be a wordplay combined with the
> p?
"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.
> I don't think you should specify the read/write and locality parameters
> if we don't hand-pick them - right now you're saying the memory will
> only be read and that it has no temporal locality.
>
> I think there should be a empty fallback definition even if the current
> only caller ends up not needing it - not all callers will require it.
It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...
>> + /*
>> + * Perform memory prefetch of tuple that's three places
>> + * ahead of current (which is returned to caller).
>> + * Testing shows that this significantly boosts the
>> + * performance for TSS_INMEM "forward" callers by
>> + * hiding memory latency behind their processing of
>> + * returned tuples.
>> + */
>> +#ifdef USE_MEM_PREFETCH
>> + if (readptr->current + 3 < state->memtupcount)
>> + pg_rfetch(state->memtuples[readptr->current + 3]);
>> +#endif
>
> Hm. Why do we need a branch here? The target of prefetches don't need to
> be valid addresses and adding a bunch of arithmetic and a branch for the
> corner case doesn't sound worthwhile to me.
...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.
> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?
Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.
BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort work already posted,
because the final merge step was a big remaining bottleneck which I've
now fixed. An unlogged CREATE INDEX with 6 runs to merge is often only
15% slower than an equivalent CREATE INDEX that is all internal
(against the master branch).
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2015-09-02 19:30:14 | Re: Horizontal scalability/sharding |
Previous Message | Alvaro Herrera | 2015-09-02 19:17:15 | Re: Fwd: Core dump with nested CREATE TEMP TABLE |