| From: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
|---|---|
| To: | Merlin Moncure <mmoncure(at)gmail(dot)com> | 
| Cc: | Sergey Koposov <koposov(at)ast(dot)cam(dot)ac(dot)uk>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net> | 
| Subject: | Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile | 
| Date: | 2012-05-31 18:50:36 | 
| Message-ID: | CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov(at)ast(dot)cam(dot)ac(dot)uk> wrote:
>> On Thu, 31 May 2012, Robert Haas wrote:
>>
>>> Oh, ho.  So from this we can see that the problem is that we're
>>> getting huge amounts of spinlock contention when pinning and unpinning
>>> index pages.
>>>
>>> It would be nice to have a self-contained reproducible test case for
>>> this, so that we could experiment with it on other systems.
>>
>>
>> I have created it a few days ago:
>> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php
>>
>> It is still valid. And I'm using exactly it to test. The only thing to
>> change is to create a two-col index and drop another index.
>> The scripts are precisely the ones I'm using now.
>>
>> The problem is that in order to see a really big slowdown (10 times slower
>> than a single thread) I've had to raise the buffers to 48g but it was slow
>> for smaller shared buffer settings as well.
>>
>> But I'm not sure how sensitive the test is to the hardware.
>
> It's not: high contention on spinlocks is going to suck no matter what
> hardware you have.   I think the problem is pretty obvious now: any
> case where multiple backends are scanning the same sequence of buffers
> in a very tight loop is going to display this behavior.  It doesn't
> come up that often: it takes a pretty unusual sequence of events to
> get a bunch of backends hitting the same buffer like that.
>
> Hm, I wonder if you could alleviate the symptoms by making making the
> Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
> pinned longer -- kinda as if your private ref count was hacked to be
> higher in that case.   It would be a complex fix for a narrow issue
> though.
This test case is unusual because it hits a whole series of buffers
very hard.  However, there are other cases where this happens on a
single buffer that is just very, very hot, like the root block of a
btree index, where the pin/unpin overhead hurts us.  I've been
thinking about this problem for a while, but it hasn't made it up to
the top of my priority list, because workloads where pin/unpin is the
dominant cost are still relatively uncommon.  I expect them to get
more common as we fix other problems.
Anyhow, I do have some vague thoughts on how to fix this.  Buffer pins
are a lot like weak relation locks, in that they are a type of lock
that is taken frequently, but rarely conflicts.  And the fast-path
locking in 9.2 provides a demonstration of how to handle this kind of
problem efficiently: making the weak, rarely-conflicting locks
cheaper, at the cost of some additional expense when a conflicting
lock (in this case, a buffer cleanup lock) is taken.  In particular,
each backend has its own area to record weak relation locks, and a
strong relation lock must scan all of those areas and migrate any
locks found there to the main lock table.  I don't think it would be
feasible to adopt exactly this solution for buffer pins, because page
eviction and buffer cleanup locks, while not exactly common, are
common enough that we can't require a scan of N per-backend areas
every time one of those operations occurs.
But, maybe we could have a system of this type that only applies to
the very hottest buffers.  Suppose we introduce two new buffer flags,
BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
the buffer can't be evicted until that flag is removed, and backends
are permitted to record pins in a per-backend area protected by a
per-backend spinlock or lwlock, rather than in the buffer header.
When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
point, it's no longer permissible to record new pins in the
per-backend areas, but old ones may still exist.  So then we scan all
the per-backend areas and transfer the pins to the buffer header, or
else just wait until no more exist; then, we clear both BUF_NAILED and
BUF_NAIL_REMOVAL.
So the pin algorithm looks like this:
read buffer header flags (unlocked)
if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)
{
    take buffer header spinlock
    record pin in buffer header
    release buffer header spinlock;
}
else
{
    take per-backend lwlock
    record pin in per-backend area
    release per-backend lwlock
    read buffer header flags (unlocked)
    if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)
    {
        take per-backend lwlock
        forget pin in per-backend area
        release per-backend lwlock
        take buffer header spinlock
        record pin in buffer header
        release buffer header spinlock
    }
}
Due to memory ordering effects, we might see the buffer as nailed when
in fact nail removal has already begun (or even, completed).  We can
prevent that if (1) the nail removal code sets the nail removal flag
before checking the per-backend areas and (2) the pin code checks the
nail removal flag AFTER checking the per-backend areas.  Since
LWLockRelease is a sequencing point, the above algorithm is consistent
with that scheme; the initial unlocked test of the buffer header flags
is merely a heuristic to avoid extra work in the common case where the
buffer isn't nailed.
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2012-05-31 19:02:23 | Re: Issues with MinGW W64 | 
| Previous Message | Cédric Villemain | 2012-05-31 18:42:02 | Re: Figuring out shared buffer pressure |