From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Move unused buffers to freelist |
Date: | 2013-06-27 13:01:46 |
Message-ID: | 20130627130146.GH1254@alap2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
> I'd like to just back up a minute here and talk about the broader
> picture here.
Sounds like a very good plan.
> So in other words,
> there's no huge *performance* problem for a working set larger than
> shared_buffers, but there is a huge *scalability* problem. Now why is
> that?
> As far as I can tell, the answer is that we've got a scalability
> problem around BufFreelistLock.
Part of the problem is it's name ;)
> Contention on the buffer mapping
> locks may also be a problem, but all of my previous benchmarking (with
> LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
> in the room.
Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.
> My interest in having the background writer add buffers
> to the free list is basically around solving that problem. It's a
> pretty dramatic problem, as the graph above shows, and this patch
> doesn't solve it.
> One thing that occurred to me while writing this note is that the
> background writer doesn't have any compelling reason to run on a
> read-only workload. It will still run at a certain minimum rate, so
> that it cycles the buffer pool every 2 minutes, if I remember
> correctly.
I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.
I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.
Problems with the current code:
* doesn't manipulate the usage_count and never does anything to used
pages. Which means it will just about never find a victim buffer in a
busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
clock sweep.
* does not advance the clock sweep, so the individual backends will
touch the same buffers again and transfer all the buffer spinlock
cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
several minutes to adapt.
* ...
There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
// check freelist, will usually be empty
...
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
++StrategyControl->nextVictimBuffer;
LockBufHdr(buf);
if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
}
else
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
UnlockBufHdr(buf);
}
So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction & bouncing.
It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.
I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2013-06-27 13:02:39 | Re: [PATCH] add long options to pgbench (submission 1) |
Previous Message | Andres Freund | 2013-06-27 12:27:58 | Re: XLogInsert scaling, revisited |