From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease |
Date: | 2014-03-07 15:54:32 |
Message-ID: | 5319EBB8.2070500@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 02/18/2014 09:23 PM, Heikki Linnakangas wrote:
> On 02/17/2014 10:36 PM, Andres Freund wrote:
>> On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:
>>> This is what I came up with. I like it, I didn't have to contort lwlocks as
>>> much as I feared. I added one field to LWLock structure, which is used to
>>> store the position of how far a WAL inserter has progressed. The LWLock code
>>> calls it just "value", without caring what's stored in it, and it's used by
>>> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
>>> insertion slots have, to wake up other processes waiting for the slot
>>> without releasing it.
>>>
>>> This passes regression tests, but I'll have to re-run the performance tests
>>> with this. One worry is that if the padded size of the LWLock struct is
>>> smaller than cache line, neighboring WAL insertion locks will compete for
>>> the cache line. Another worry is that since I added a field to LWLock
>>> struct, it might now take 64 bytes on platforms where it used to be 32 bytes
>>> before. That wastes some memory.
>>
>> Why don't you allocate them in a separate tranche, from xlog.c? Then you
>> can store them inside whatever bigger object you want, guaranteeing
>> exactly the alignment you need. possibly you even can have the extra
>> value in the enclosing object?
>
> Good idea. New patch attached, doing that.
>
> I'll try to find time on some multi-CPU hardware to performance test
> this against current master, to make sure there's no regression.
Ok, I ran the same tests I used earlier for the xloginsert scaling
patch, with REL9_3_STABLE, current master, and the patch to refactor the
xloginsert slots into lwlocks.
The main question I was trying to answer was: Is the new patch similar
in performance to current master? The short answer is "yes". In some
tests there was significant differences, but overall I couldn't say
which one was better.
The test case I used is pgbench with a custom script using a backend
extension called "xlogtest", which just does a bunch XLOGInserts of
dummy WAL records of given size. Small and large WAL records and have
quite different characteristics in how they stress the xlog machinery. I
used three different WAL record sizes: 20, 100 and 8000 bytes, excluding
the WAL record header. 20 bytes is pretty much the minimum size of a
realistic WAL record, for something like a heap deletion. 100 bytes
would be typical of an insert or update record, while 8000 bytes would
be a full-page write or b-tree page split. The number of such inserts
done per transaction was scaled so that each transaction inserts about
100000 bytes in total. That's quite a lot, but with shorter transactions
you easily get bottlenecked by other things like the ProcArrayLock, and
the point of this test was to exercise WAL insertions.
I ran the tests on three different hardware: my laptop with 4 cores (8
logical cores with hyperthreading) and an SSD disk, a virtual machine
running on a host with 32 cores (no other VMs running) with some kind of
external storage (I don't know the details), and Nathan Boley's 64-core
AMD box (thanks Nate for lending it again!). On the AMD box, I ran the
tests twice, once with data directory on the disk, and once in /dev/shm.
The results are varying. Overall, both git master and the patched
version perform similarly, and at least as well as REL9_3_STABLE. There
are a few clear exceptions to that: in Nathan's box with data directory
on disk, the patched version performs much better than either git or
REL9_3_STABLE, with 20 byte payload. And on my laptop, with 20 byte
payload, git master performs somewhat better than the patched version,
but still better than REL9_3_STABLE, except when running with single client.
I collected the summary graphs of all the tests here (you can click the
graphs for the details pgbench-tools result pages):
http://hlinnaka.iki.fi/xlogslot-to-lwlock-results/
Some caveats on the test methodology:
1. I didn't run the tests for the same duration on all the different
machines. The test duration on the AMD box was 20 seconds for the
disk-based tests and 30 seconds for the RAM-disk based tests. On the
32-core VM and my laptop, the test duration was 60 seconds. So you
cannot compare the tests on different hardware directly.
2. All those test durations were pretty short. That means that the TPS
number in any individual test result is quite noisy, and you should look
at the general shapes of the graphs instead of individual points. 3. The
number of checkpoints during the tests varied, which again creates a lot
of noise in the individual points.
4. In the last test in the series, on the 64-core AMD box with data dir
in RAM drive, the patched test with 64 clients deadlocked. I tracked it
down to a bug in the patch, in how the insertion's progress is reported
when holding the WALInsertLocks in exclusive-mode, ie. when holding them
all, when starting a checkpoint. The exclusive lock is held so seldom
that I have no reason to believe that it affects the performance, but
nevertheless the patch I tested was not 100% identical to the fixed
version attached. That explains the apparent dip in performance with 64
clients with the patched version.
So there are some unexplained differences there, but based on these
results, I'm still OK with committing the patch.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
xlogslot-to-lwlock-3.patch | text/x-diff | 52.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-03-07 16:00:09 | Re: atexit_callback can be a net negative |
Previous Message | Andres Freund | 2014-03-07 15:40:12 | Re: Unportable coding in reorderbuffer.h |