From: | Marc Colosimo <mcolosimo(at)mitre(dot)org> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | List pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Marc Colosimo <mcolosimo(at)mitre(dot)org>, Manfred Spraul <manfred(at)colorfullife(dot)com>, Karel Zak <zakkr(at)zf(dot)jcu(dot)cz> |
Subject: | Re: tweaking MemSet() performance - 7.4.5 |
Date: | 2004-09-29 13:38:39 |
Message-ID: | DF0A0E72-121C-11D9-830D-000A95A5D8B2@mitre.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sep 29, 2004, at 7:37 AM, Bruce Momjian wrote:
> Karel Zak wrote:
>> On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:
>>> mcolosimo(at)mitre(dot)org wrote:
>>>
>>>>> If the memset
>>>>> bypasses the cache then the following access will cause a cache
>>>>> line
>>>>> miss, which can be so slow that using the faster memset can result
>>>>> in a
>>>>> net performance loss.
>>>>
>>>> Could you suggest some structs to test? If I get your meaning, I
>>>> would make a loop that sets then reads from the structure.
>>>>
>>> Read the sources and the cpu specs. Benchmarking such problems is
>>> virtually impossible.
>>> I don't have OS-X, thus I checked the Linux-kernel sources: It seems
>>> that the power architecture doesn't have the same problem as x86.
>>> There is a special clear cacheline instruction for large memsets and
>>> the
>>> rest is done through carefully optimized store
>>> byte/halfword/word/double
>>> word sequences.
>>>
>>> Thus I'd check what happens if you memset not perfectly aligned
>>> buffers.
>>> That's another point where over-optimized functions sometimes break
>>> down. If there is no slowdown, then I'd replace the postgres function
>>> with the OS provided function.
>>>
all memory (via malloc and friends) will be aligned on OS X, unless you
remove padding (which I don't think you do)
>>> I'd add some __builtin_constant_p() optimizations, but I guess Tom
>>> won't
>>> like gcc hacks ;-)
>>
>> I think it cannot be problem if you write it to some .h file (in port
>> directory?) as macro with "#ifdef GCC". The other thing is real
>> advantage of hacks like this in practical PG usage :-)
>
> The reason MemSet is a win is not that the C code is great but because
> it eliminates a function call.
>
Using MemSet really did speed things up. I think the function overhead
is okay. As for real world usage, the function ExecMakeFunctionResult
dropped from the top of the list when profiling (now < 1% vs 16%
before)! This was doing a big nasty delete (w/ cascading), insert in a
cursor.
Here are results for a Mac G4 (single processor) OS 10.3, using -O2.
This time the mac memset wins all around. Someone posted that this
wasn't the case.
PG MemSet:
pgmemset_test 32
0.670u 0.020s 0:00.70 98.5% 0+0k 0+0io 0pf+0w
pgmemset_test 64
1.060u 0.000s 0:01.05 100.9% 0+0k 0+0io 0pf+0w
pgmemset_test 128
1.750u 0.010s 0:01.76 100.0% 0+0k 0+0io 0pf+0w
pgmemset_test 512
6.010u 0.030s 0:06.04 100.0% 0+0k 0+0io 0pf+0w
Mac memset:
memset_test 32
0.660u 0.020s 0:00.67 101.4% 0+0k 0+0io 0pf+0w
memset_test 64
0.720u 0.000s 0:00.72 100.0% 0+0k 0+0io 0pf+0w
memset_test 128
0.800u 0.010s 0:00.81 100.0% 0+0k 0+0io 0pf+0w
memset_test 512
1.470u 0.010s 0:01.48 100.0% 0+0k 0+0io 0pf+0w
Now I check about setting a byte after I memset, and it does slow down
a tiny bit. But it is the same for both MemSet and memset for under 64.
From | Date | Subject | |
---|---|---|---|
Next Message | Kris Kiger | 2004-09-29 14:33:06 | Re: tsearch2 poor performance |
Previous Message | Merlin Moncure | 2004-09-29 12:56:12 | Re: shared memory release following failed lock acquirement. |