From: | "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Subject: | Re: User concurrency thresholding: where do I look? |
Date: | 2007-07-26 15:27:30 |
Message-ID: | 46A8BD62.2080904@sun.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
I will look for runs with longer samples..
Now the script could have mislabeled lock names.. Anyway digging into
the one that seems to increase over time... I did stack profiles on how
that increases... and here are some numbers..
For 600-850 Users: that potential mislabeled CheckPointStartLock or
LockID==12 comes from various sources where the top source (while system
is still doing great) comes from:
postgres`LWLockAcquire+0x1c8
postgres`SimpleLruReadPage_ReadOnly+0xc
postgres`TransactionIdGetStatus+0x14
postgres`TransactionLogFetch+0x58
postgres`TransactionIdDidCommit+0x4
postgres`HeapTupleSatisfiesSnapshot+0x234
postgres`heap_release_fetch+0x1a8
postgres`index_getnext+0xf4
postgres`IndexNext+0x7c
postgres`ExecScan+0x8c
postgres`ExecProcNode+0xb4
postgres`ExecutePlan+0xdc
postgres`ExecutorRun+0xb0
postgres`PortalRunSelect+0x9c
postgres`PortalRun+0x244
postgres`exec_execute_message+0x3a0
postgres`PostgresMain+0x1300
postgres`BackendRun+0x278
postgres`ServerLoop+0x63c
postgres`PostmasterMain+0xc40
8202100
postgres`LWLockAcquire+0x1c8
postgres`TransactionIdSetStatus+0x1c
postgres`RecordTransactionCommit+0x2a8
postgres`CommitTransaction+0xc8
postgres`CommitTransactionCommand+0x90
postgres`finish_xact_command+0x60
postgres`exec_execute_message+0x3d8
postgres`PostgresMain+0x1300
postgres`BackendRun+0x278
postgres`ServerLoop+0x63c
postgres`PostmasterMain+0xc40
postgres`main+0x394
postgres`_start+0x108
30822100
However at 900 Users where the big drop in throughput occurs:
It gives a different top "consumer" of time:
postgres`LWLockAcquire+0x1c8
postgres`TransactionIdSetStatus+0x1c
postgres`RecordTransactionCommit+0x2a8
postgres`CommitTransaction+0xc8
postgres`CommitTransactionCommand+0x90
postgres`finish_xact_command+0x60
postgres`exec_execute_message+0x3d8
postgres`PostgresMain+0x1300
postgres`BackendRun+0x278
postgres`ServerLoop+0x63c
postgres`PostmasterMain+0xc40
postgres`main+0x394
postgres`_start+0x108
406601300
postgres`LWLockAcquire+0x1c8
postgres`SimpleLruReadPage+0x1ac
postgres`TransactionIdGetStatus+0x14
postgres`TransactionLogFetch+0x58
postgres`TransactionIdDidCommit+0x4
postgres`HeapTupleSatisfiesUpdate+0x360
postgres`heap_lock_tuple+0x27c
postgres`ExecutePlan+0x33c
postgres`ExecutorRun+0xb0
postgres`PortalRunSelect+0x9c
postgres`PortalRun+0x244
postgres`exec_execute_message+0x3a0
postgres`PostgresMain+0x1300
postgres`BackendRun+0x278
postgres`ServerLoop+0x63c
postgres`PostmasterMain+0xc40
postgres`main+0x394
postgres`_start+0x108
444523100
postgres`LWLockAcquire+0x1c8
postgres`SimpleLruReadPage+0x1ac
postgres`TransactionIdGetStatus+0x14
postgres`TransactionLogFetch+0x58
postgres`TransactionIdDidCommit+0x4
postgres`HeapTupleSatisfiesSnapshot+0x234
postgres`heap_release_fetch+0x1a8
postgres`index_getnext+0xf4
postgres`IndexNext+0x7c
postgres`ExecScan+0x8c
postgres`ExecProcNode+0xb4
postgres`ExecutePlan+0xdc
postgres`ExecutorRun+0xb0
postgres`PortalRunSelect+0x9c
postgres`PortalRun+0x244
postgres`exec_execute_message+0x3a0
postgres`PostgresMain+0x1300
postgres`BackendRun+0x278
postgres`ServerLoop+0x63c
postgres`PostmasterMain+0xc40
1661300000
Maybe you all will understand more than I do on what it does here..
Looks like IndexNext has a problem at high number of users to me.. but I
could be wrong..
-Jignesh
Tom Lane wrote:
> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>
>> The count is only for a 10-second snapshot.. Plus remember there are
>> about 1000 users running so the connection being profiled only gets
>> 0.01 of the period on CPU.. And the count is for that CONNECTION only.
>>
>
> OK, that explains the low absolute levels of the counts, but if the
> counts are for a regular backend then there's still a lot of bogosity
> here. Backends won't be taking the CheckpointLock at all, nor do they
> take CheckpointStartLock in exclusive mode. The bgwriter would do that
> but it'd not be taking most of these other locks. So I think the script
> is mislabeling the locks somehow.
>
> Also, elementary statistics should tell you that a sample taken as above
> is going to have enormous amounts of noise. You should be sampling over
> a much longer period, say on the order of a minute of runtime, to have
> numbers that are trustworthy.
>
> regards, tom lane
>
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2007-07-26 15:42:48 | Re: User concurrency thresholding: where do I look? |
Previous Message | Simon Riggs | 2007-07-26 15:25:31 | Re: User concurrency thresholding: where do I look? |