Re: Scaling shared buffer eviction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-23 20:29:16
Message-ID: CA+TgmoY58dQi8Z=FDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:

- LWLockAcquire
+ 68.41% ReadBuffer_common
+ 31.59% StrategyGetBuffer

With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks. But are we suffering spinlock contention instead as a
result? Yes, but not much. s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer. In a similar profile of master, s_lock is at
0.31%. So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

TPS results are a little higher with the patch - these are alternating
5 minute runs:

master tps = 176010.647944 (including connections establishing)
master tps = 176615.291149 (including connections establishing)
master tps = 175648.370487 (including connections establishing)
reduce-replacement-locking tps = 177888.734320 (including connections
establishing)
reduce-replacement-locking tps = 177797.842410 (including connections
establishing)
reduce-replacement-locking tps = 177894.822656 (including connections
establishing)

The picture is similar at 64 clients, but the benefit is a little more:

master tps = 179037.231597 (including connections establishing)
master tps = 180500.937068 (including connections establishing)
master tps = 181565.706514 (including connections establishing)
reduce-replacement-locking tps = 185741.503425 (including connections
establishing)
reduce-replacement-locking tps = 188598.728062 (including connections
establishing)
reduce-replacement-locking tps = 187340.977277 (including connections
establishing)

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

reduce-replacement-locking-128 tps = 251001.812843 (including
connections establishing)
reduce-replacement-locking-128 tps = 247368.925927 (including
connections establishing)
reduce-replacement-locking-128 tps = 250775.304177 (including
connections establishing)

The performance also goes up if I do that on master, but the effect is
substantially less:

master-128 tps = 219301.492902 (including connections establishing)
master-128 tps = 219786.249076 (including connections establishing)
master-128 tps = 219821.220271 (including connections establishing)

I think this shows pretty clearly that, even without the bgreclaimer,
there's a lot of merit in getting rid of BufFreelistLock and using a
spinlock held for the absolutely minimal number of instructions
instead. There's already some benefit without doing anything about
the buffer mapping locks, and we'll get a lot more benefit once that
issue is addressed. I think we need to do some serious study to see
whether bgreclaimer is even necessary, because it looks like this
change alone, which is much simpler, takes a huge bite out of the
scalability problem.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch. To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win. We can then deal with
the other issues on their own merits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
reduce-replacement-locking.patch text/x-patch 11.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-09-23 20:33:15 Re: RLS feature has been committed
Previous Message Christoph Berg 2014-09-23 20:02:13 Re: Hide 'Execution time' in EXPLAIN (COSTS OFF)