From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Scaling shared buffer eviction |
Date: | 2014-09-04 11:25:58 |
Message-ID: | CAA4eK1+QEcKVaZRi5ipmCnyubFs7tdsjckLgXk_xV-oTCB6KOA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Sep 3, 2014 at 8:03 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> >> + while (tmp_num_to_free > 0)
> >>
> >> I am not sure it's a good idea for this value to be fixed at loop
> >> start and then just decremented.
> >
> > It is based on the idea what bgwriter does for num_to_scan and
> > calling it once has advantage that we need to take freelist_lck
> > just once.
>
> Right, we shouldn't call it every loop iteration. However, consider
> this scenario: there are no remaining buffers on the list and the high
> watermark is 2000. We add 2000 buffers to the list. But by the time
> we get done, other backends have already done 500 more allocations, so
> now there are only 1500 buffers on the list. If this should occur, we
> should add an additional 500 buffers to the list before we consider
> sleeping. We want bgreclaimer to be able to run continuously if the
> demand for buffers is high enough.
Its not difficult to handle such cases, but it can have downside also
for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers. Another point to keep in mind here is that in this
loop we are reducing the usage_count of buffers as well incase we don't
find buffer with usage_count=0. OTOH if we let bgreclaimer to go for
sleep after it moves initially identified buffers, then the backend which
first finds that the buffers in freelist falls below low water mark can
wake bgreclaimer.
> >> In freelist.c, it seems like a poor idea to have two spinlocks as
> >> consecutive structure members; they'll be in the same cache line,
> >> leading to false sharing. If we merge them into a single spinlock,
> >> does that hurt performance?
> >
> > I have kept them separate so that backends searching for a buffer
> > in freelist doesn't contend with bgreclaimer (while doing clock sweep)
> > or clock sweep being done by other backends. I think it will be bit
> > tricky to devise a test where this can hurt, however it doesn't seem
> > too bad to have two separate locks in this case.
>
> It's not. But if they are in the same cache line, they will behave
> almost like one lock, because the CPU will lock the entire cache line
> for each atomic op. See Tom's comments upthread.
I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both
the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
I have taken performance data as well by keeping them further apart
as suggested by you upthread and by introducing padding, but the
difference in performance is less than 1.5% (on 64 and 128 client count)
which also might be due to variation of data across runs. So now to
proceed we have below options:
a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently
in patch, but in future we can adding padding to take the advantage if
possible (currently on PPC64, it doesn't show any noticeable advantage,
however on some other m/c, it might show the advantage).
b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.
Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Tiikkaja | 2014-09-04 11:37:14 | Re: PL/pgSQL 1.2 |
Previous Message | Pavel Stehule | 2014-09-04 09:41:23 | Re: PL/pgSQL 1.2 |