Re: Reducing spinlock acquisition within clock sweep loop

From: Andres Freund <andres(at)anarazel(dot)de>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing spinlock acquisition within clock sweep loop
Date: 2015-04-21 20:25:02
Message-ID: 20150421202502.GM14483@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote:
> The main sources of contention, buffer_strategy_lock, has been removed
> FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick
> locks the buffer header via spinlock in order to to adjust
> usage_count.

FWIW, I think the best approach to achieve this is to make most buffer
header manipulations lockless. It's not trivial but quite possible. I'd
a very rough POC a while back ([1]); that only degenerated to something
spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER
IIRC). That'd not necessarily reduce the number of atomic ops, but
making things lockless makes does away with the problem of sleeping
while holding a spinlock etc.

The primary use case for that is less the clock sweep than things index
root pages and small lookup pages. I've seen several times that the
spinlock can really hurt there.

I do believe that we can't continue with our current clock sweep
approach for much longer. It's already extremely expensive and even if
we fix some of the most glaring problems (as discussed nearby atm), it's
still a horribly cache/runtime inefficient way to do things in many
workloads.

> Right now usage_count ranges across a small number (currently 0-5),
> incrementing upon allocation and decrementing on sweep. Because of
> this, each tick of the clock sweep spinlocks the buffer in preparation
> of A. the buffer being returned or B. the buffer having a non zero
> usage count so that it can be safely decremented. B is potential
> optimization fruit, I think.

> The idea:
> Instead, why not let usage_count rotate around on it's own in a
> clock-like fashion? the idea is that, instead of defining usage count
> as a fixed offset from zero, it is based on the number of times the
> buffer has been touched since the last sweep. In other words, each
> time the buffer clock crosses midnight, 'usage count' for the whole
> pool drops by one, implicitly. Each time a buffer is touched, it's
> usage_count is set to 'completePasses' + 1, but, just like today, is
> never allowed to increment past completePasses + 5. So, what used to
> be usage_count becomes defined as the distance between the usage_count
> and completePasses.
>
> Because of usage_count tracking rotation, calculating what used to be
> usage_count requires a bit of logic to arrive at the right number as
> rotation happens. I'm pretty sure all the rotation edge cases can be
> handled but I'm glossing over them for now.

> Clock sweep buffer acquisition currently looks like this:
> loop:
> tick();
> spinlock();
> in use? goto loop
> usage_count > 0? yes: decrement, goto loop
> return buffer;
>
> Proposed buffer acquisition would look like:
> loop:
> tick();
> victim < last_victim? yes: atomic re-fetch completePasses (see below)
> usage_count > completePasses? yes: goto loop
> try_spinlock(); no lock? goto loop
> usage_count > completePasses? yes: goto loop
> in use? goto loop
> return buffer;
>
> (try_spinlock is an adorned TAS() as a opposed to a full spin). Usage
> count is double checked after lock acquisition in case the local copy
> is stale.

I don't really see what this improves? You could do the "try spinlock"
bit just as well with the old algorithm and I don't really see a
accuracy benefit in the proposed approach? Maybe I'm missing something
entirely?

[1]: The basic trick I have in mind is that by putting flags, usagecount
and refcount into one integer (in separate bits) it is possible to do
CAS loops to manipulate each of those. E.g. to decrement usagecount you
can do something roughly like

do
{
val = pg_atomic_read_u32(desc->flag_and_counts);
if (BufferDescUsageCount(val) == 0)
break;

newval = BufferDescChangeUsageCount(val, -1);
}
while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts,
&val, newval))

Where BufferDescUsageCount just to some bit masking and shifting to
manipulate the right part. Same for refcount and most flags.

To deal with changing tags I'd introduced a flag value that was used to
say 'you have to spinlock this time'.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-04-21 20:26:08 Re: Replication identifiers, take 4
Previous Message Robert Haas 2015-04-21 20:21:47 Re: Freeze avoidance of very large table.