Re: Proposal of tunable fix for scalability of 8.4

From: "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To: Scott Carey <scott(at)richrelevance(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Proposal of tunable fix for scalability of 8.4
Date: 2009-03-12 18:37:32
Message-ID: 49B9566C.3010708@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 03/12/09 13:48, Scott Carey wrote:
> On 3/11/09 7:47 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> All I'm adding, is that it makes some sense to me based on my
> experience in CPU / RAM bound scalability tuning. It was expressed
> that the test itself didn't even make sense.
>
> I was wrong in my understanding of what the change did. If it wakes
> ALL waiters up there is an indeterminate amount of time a lock will wait.
> However, if instead of waking up all of them, if it only wakes up the
> shared readers and leaves all the exclusive ones at the front of the
> queue, there is no possibility of starvation since those exclusives
> will be at the front of the line after the wake-up batch.
>
> As for this being a use case that is important:
>
> * SSDs will drive the % of use cases that are not I/O bound up
> significantly over the next couple years. All postgres installations
> with less than about 100GB of data TODAY could avoid being I/O bound
> with current SSD technology, and those less than 2TB can do so as well
> but at high expense or with less proven technology like the ZFS L2ARC
> flash cache.
> * Intel will have a mainstream CPU that handles 12 threads (6 cores,
> 2 threads each) at the end of this year. Mainstream two CPU systems
> will have access to 24 threads and be common in 2010. Higher end 4CPU
> boxes will have access to 48 CPU threads. Hardware thread count is
> only going up. This is the future.
>

SSDs are precisely my motivation of doing RAM based tests with
PostgreSQL. While I am waiting for my SSDs to arrive, I started to
emulate SSDs by putting the whole database on RAM which in sense are
better than SSDs so if we can tune with RAM disks then SSDs will be covered.

What we have is a pool of 2000 users and we start making each user do
series of transactions on different rows and see how much the database
can handle linearly before some bottleneck (system or database) kicks in
and there can be no more linear increase in active users. Many times
there is drop after reaching some value of active users. If all 2000
users can scale linearly then another test with say 2500 can be executed
.. All to do is what's the limit we can go till typically there are no
system resources still remaining to be exploited.

That said the testkit that I am using is a lightweight OLTP typish
workload which a user runs against a preknown schema and between various
transactions that it does it emulates a wait time of 200ms. That said it
is some sense emulating a real user who clicks and then waits to see
what he got and does another click which results in another transaction
happening. (Not exactly but you get the point). Like all workloads it
is generally used to find bottlenecks in systems before putting
production stuff on it.

That said my current environment I am having similar workloads and
seeing how many users can go to the point where system has no more CPU
resources available to do a linear growth in tpm. Generally as many of
you mentioned you will see disk latency, network latency, cpu resource
problems, etc.. And thats the work I am doing right now.. I am working
around network latency by doing a private network, improving Operating
systems tunables to improve efficiency out there.. I am improving disk
latency by putting them on /RAM (and soon on SSDs).. However if I still
cannot consume all CPU then it means I am probably hit by locks . Using
PostgreSQL DTrace probes I can see what's happening..

At low user (100 users) counts my lock profiles from a user point of
view are as follows:

# dtrace -q -s 84_lwlock.d 1764

Lock Id Mode State Count
ProcArrayLock Shared Waiting 1
CLogControlLock Shared Acquired 2
ProcArrayLock Exclusive Waiting 3
ProcArrayLock Exclusive Acquired 24
XidGenLock Exclusive Acquired 24
FirstLockMgrLock Shared Acquired 25
CLogControlLock Exclusive Acquired 26
FirstBufMappingLock Shared Acquired 55
WALInsertLock Exclusive Acquired 75
ProcArrayLock Shared Acquired 178
SInvalReadLock Shared Acquired 378

Lock Id Mode State Combined Time (ns)
SInvalReadLock Acquired 29849
ProcArrayLock Shared Waiting 92261
ProcArrayLock Acquired 951470
FirstLockMgrLock Exclusive Acquired 1069064
CLogControlLock Exclusive Acquired 1295551
ProcArrayLock Exclusive Waiting 1758033
FirstBufMappingLock Exclusive Acquired 2078507
XidGenLock Exclusive Acquired 3460800
WALInsertLock Exclusive Acquired 12205466
SInvalReadLock Exclusive Acquired 42684236
ProcArrayLock Exclusive Acquired 57397139

As users grow beyond 1000 it changes to the following for the sample
user point of view
# dtrace -q -s 84_lwlock.d 1764

Lock Id Mode State Count
CLogControlLock Exclusive Waiting 1
WALInsertLock Exclusive Waiting 1
ProcArrayLock Exclusive Acquired 7
XidGenLock Exclusive Acquired 7
ProcArrayLock Exclusive Waiting 10
CLogControlLock Shared Acquired 13
WALInsertLock Exclusive Acquired 23
CLogControlLock Exclusive Acquired 30
ProcArrayLock Shared Acquired 50
FirstLockMgrLock Shared Acquired 104
SInvalReadLock Shared Acquired 105
FirstBufMappingLock Shared Acquired 106

Lock Id Mode State Combined Time (ns)
WALInsertLock Exclusive Waiting 73990
CLogControlLock Exclusive Waiting 383066
XidGenLock Exclusive Acquired 408301
CLogControlLock Exclusive Acquired 1871642
ProcArrayLock Acquired 2825372
WALInsertLock Exclusive Acquired 3144580
FirstLockMgrLock Exclusive Acquired 3799818
FirstBufMappingLock Exclusive Acquired 4083473
SInvalReadLock Exclusive Acquired 20611120
ProcArrayLock Exclusive Acquired 37920098
ProcArrayLock Exclusive Waiting 3783942020

Thats similar to what I had seen last year.. But thats the reason I am
playing with lwlock.c to see how changing of how LWLockRelease() can be
modified to do different types of wake-ups have impact on this top
waiting time which is basically waste of time from perspective of
application, operating system, cpu . All I am saying is with tuning
flexibility we can actually reduce the time wasted and probably use that
time with acquired state while it is doing some useful work.

I dont think I have misconfigured the system. I am just showing that hey
there are ways to cut down some inefficiencies here and showing test
points. I am also showing where it does seem to help performance. It may
not help in all case but I just gave you a test where it helps
performance where it is better than what it is.

And again this is the third time I am saying.. the test users also have
some latency build up in them which is what generally is exploited to
get more users than number of CPUS on the system but that's the point we
want to exploit.. Otherwise if all new users begin to do their job with
no latency then we would need 6+ billion cpus to handle all possible
users. Typically as an administrator (System and database) I can only
tweak/control latencies within my domain, that is network, disk, cpu's
etc and those are what I am tweaking and coming to a *Configured*
environment and now trying to improve lock contentions/waits in
PostgreSQL so that we have an optimized setup.

I am trying another run where I limit the waked up threads to a
pre-configured number to see how various numbers pans out in terms of
throughput on this server.

Regards,
Jignesh

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alvaro Herrera 2009-03-12 19:10:20 Re: Proposal of tunable fix for scalability of 8.4
Previous Message Ron 2009-03-12 18:32:38 Re: Proposal of tunable fix for scalability of 8.4