Re: Re: recently and selectively slow, but very simple, update query....

From: Stelios Mavromichalis <mstelios(at)cytech(dot)gr>
To: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Re: recently and selectively slow, but very simple, update query....
Date: 2014-05-07 15:57:50
Message-ID: CAOfbx2Z8r-YM13-K8HZknVfGT9qHqNhgNOq2SCXZHWRV1tLx=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

hello everyone,

i feel like i owe you all an apology.

it _was_ a locking problem after all.

i don't know why i was unlucky enough to never detect it for so long, no
matter how hard i tried and how many times i did.

eventually, after many hours on the problem and hitting a locks dependency
query (last one from <http://wiki.postgresql.org/wiki/Lock_Monitoring>)
many times without apparent (seemed at the time) reason, it revealed the
heart of my problem.

there was another query updating another column (opt_pushdlr_ltstamp) on
that particular row (usid = 10808). however, due to bad written code
(didn't addBatch and execute all at once instead updated instantly and
delaying the commit while waiting for external _slow_ servers to reply) and
external events that update didn't commit for long periods of time, forcing
the original balance update to wait for it.

so, thanks to your advise (that a dump/restore wouldn't help me) i didn't
bring the service down for a dump/restore cycle that wasn't required after
all saving me service down time.

i can't thank you enough for leading me to solve this problem and helping
me obtain deeper understanding of postgres.

wish you all have a great day.

my apology and best regards,

/mstelios

Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios

On Tue, May 6, 2014 at 4:31 PM, Stelios Mavromichalis <mstelios(at)cytech(dot)gr>wrote:

> hello again,
>
> some more data on the subject:
>
> you are right, delete/re-insert didn't solve the problem (haven't yet
> tried dump/restore, i might tonight, when low traffic).
>
> a short snip from logs were i log all queries that take longer then 1sec:
>
> 2014-05-06T15:57:46.303880+03:00 s10 postgres[46220]: [1891-1] prosms
> prosms LOG: 00000: duration: 1947.172 ms execute <unnamed>: select
> easysms_jb_pay($1,
> $2)
> 2014-05-06T15:57:46.304230+03:00 s10 postgres[46220]: [1891-2] prosms
> prosms DETAIL: parameters: $1 = '10808', $2 = '2'
> 2014-05-06T15:57:46.304439+03:00 s10 postgres[46220]: [1891-3] prosms
> prosms LOCATION: exec_execute_message, postgres.c:1989
> 2014-05-06T15:57:56.199005+03:00 s10 postgres[58002]: [2570-1] prosms
> prosms LOG: 00000: duration: 6869.886 ms execute <unnamed>: select
> easysms_jb_pay($1,$2)
> 2014-05-06T15:57:56.199349+03:00 s10 postgres[58002]: [2570-2] prosms
> prosms DETAIL: parameters: $1 = '10808', $2 = '2'
> 2014-05-06T15:57:56.199567+03:00 s10 postgres[58002]: [2570-3] prosms
> prosms LOCATION: exec_execute_message, postgres.c:1989
> 2014-05-06T15:57:59.134982+03:00 s10 postgres[58002]: [2571-1] prosms
> prosms LOG: 00000: duration: 1797.747 ms execute <unnamed>: select
> easysms_jb_pay($1,$2)
> 2014-05-06T15:57:59.135334+03:00 s10 postgres[58002]: [2571-2] prosms
> prosms DETAIL: parameters: $1 = '10808', $2 = '2'
> 2014-05-06T15:57:59.135562+03:00 s10 postgres[58002]: [2571-3] prosms
> prosms LOCATION: exec_execute_message, postgres.c:1989
> 2014-05-06T15:58:07.149477+03:00 s10 postgres[46220]: [1892-1] prosms
> prosms LOG: 00000: duration: 3938.979 ms execute <unnamed>: select
> easysms_jb_pay($1,$2)
> 2014-05-06T15:58:07.149830+03:00 s10 postgres[46220]: [1892-2] prosms
> prosms DETAIL: parameters: $1 = '10808', $2 = '2'
> 2014-05-06T15:58:07.150067+03:00 s10 postgres[46220]: [1892-3] prosms
> prosms LOCATION: exec_execute_message, postgres.c:1989
> 2014-05-06T16:01:33.784422+03:00 s10 postgres[58002]: [2572-1] prosms
> prosms LOG: 00000: duration: 2921.212 ms execute <unnamed>: select
> easysms_jb_pay($1,$2)
> 2014-05-06T16:01:33.784842+03:00 s10 postgres[58002]: [2572-2] prosms
> prosms DETAIL: parameters: $1 = '10808', $2 = '4'
> 2014-05-06T16:01:33.785037+03:00 s10 postgres[58002]: [2572-3] prosms
> prosms LOCATION: exec_execute_message, postgres.c:1989
>
>
> should you deem helpful yes i can enable full query logging and summon the
> listing for you.
>
> a typical vmstat:
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id
> wa st
> 1 0 363476 325852 294064 30520736 0 0 26 347 0 0 1 0
> 99 0 0
> 1 0 363476 328400 294068 30520732 0 0 0 28 4930 8014 6 1
> 94 0 0
> 2 0 363476 331756 294068 30520732 0 0 0 4384 4950 7980 6 1
> 93 0 0
> 0 0 363476 334016 294068 30520756 0 0 0 4384 4961 7981 7 1
> 92 0 0
> 0 0 363476 334700 294068 30520756 0 0 0 4424 4012 6467 4 1
> 95 0 0
> 1 0 363476 330852 294068 30520788 0 0 0 0 2559 3861 5 1
> 95 0 0
> 1 0 363476 331316 294072 30520788 0 0 0 4408 5013 8127 6 1
> 94 0 0
> 1 0 363476 330788 294072 30520788 0 0 0 4384 5535 9055 6 1
> 93 0 0
> 0 0 363476 331496 294072 30520804 0 0 0 4384 5031 8092 7 1
> 92 0 0
> 2 0 363476 331268 294072 30520804 0 0 0 4428 5052 8246 6 1
> 94 0 0
> 1 0 363476 330848 294080 30520812 0 0 0 32 4892 7950 5 1
> 94 0 0
> 1 0 363476 330480 294080 30520812 0 0 0 4388 4935 8036 6 1
> 94 0 0
> 2 0 363476 332616 294084 30521092 0 0 0 4408 5064 8194 6 1
> 93 0 0
> 0 0 363476 333596 294084 30521008 0 0 0 4384 5205 8463 8 1
> 91 0 0
> 1 0 363476 333324 294084 30521008 0 0 0 40 5014 8151 6 1
> 94 0 0
> 0 0 363476 332740 294084 30521016 0 0 0 4384 5047 8163 6 1
> 93 0 0
> 1 0 363476 336052 294084 30520888 0 0 0 4384 4849 7780 6 1
> 94 0 0
> 1 0 363476 334732 294088 30520892 0 0 8 4400 5520 9012 6 1
> 93 0 0
> 0 0 363476 334064 294096 30520884 0 0 0 220 3903 6193 6 1
> 94 0 0
> 0 0 363476 333124 294096 30520916 0 0 0 2232 4088 6462 6 1
> 93 0 0
>
> the process that is constantly writing the majority of data is:
>
> "postgres: stats collector process"
>
> and varies from 2.5mb/sec up to 5.7mb/sec
>
> all the other postgres (and non-postgres) processes write very little data
> and rarely.
>
> the checkpointer process is like 78.6kb/sec (a few seconds now as i write
> this email but no other is having a constant rate or I/O)
>
> also, _while having the problem_ the results of the following queries are
> (taken from http://wiki.postgresql.org/wiki/Lock_Monitoring)
>
> SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;
>
> (0 rows)
>
>
> SELECT bl.pid AS blocked_pid,
> a.usename AS blocked_user,
> ka.query AS blocking_statement,
> now() - ka.query_start AS blocking_duration,
> kl.pid AS blocking_pid,
> ka.usename AS blocking_user,
> a.query AS blocked_statement,
> now() - a.query_start AS blocked_duration
> FROM pg_catalog.pg_locks bl
> JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid
> JOIN pg_catalog.pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
> JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
> WHERE NOT bl.granted;
> (0 rows)
>
> SELECT
> waiting.locktype AS waiting_locktype,
> waiting.relation::regclass AS waiting_table,
> waiting_stm.query AS waiting_query,
> waiting.mode AS waiting_mode,
> waiting.pid AS waiting_pid,
> other.locktype AS other_locktype,
> other.relation::regclass AS other_table,
> other_stm.query AS other_query,
> other.mode AS other_mode,
> other.pid AS other_pid,
> other.granted AS other_grantedFROM
> pg_catalog.pg_locks AS waitingJOIN
> pg_catalog.pg_stat_activity AS waiting_stm
> ON (
> waiting_stm.pid = waiting.pid
> )JOIN
> pg_catalog.pg_locks AS other
> ON (
> (
> waiting."database" = other."database"
> AND waiting.relation = other.relation
> )
> OR waiting.transactionid = other.transactionid
> )JOIN
> pg_catalog.pg_stat_activity AS other_stm
> ON (
> other_stm.pid = other.pid
> )WHERE
> NOT waiting.grantedAND
> waiting.pid <> other.pid;
> (0 rows)
>
>
>
> WITH RECURSIVE
> c(requested, current) AS
> ( VALUES
> ('AccessShareLock'::text, 'AccessExclusiveLock'::text),
> ('RowShareLock'::text, 'ExclusiveLock'::text),
> ('RowShareLock'::text, 'AccessExclusiveLock'::text),
> ('RowExclusiveLock'::text, 'ShareLock'::text),
> ('RowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
> ('RowExclusiveLock'::text, 'ExclusiveLock'::text),
> ('RowExclusiveLock'::text, 'AccessExclusiveLock'::text),
> ('ShareUpdateExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
> ('ShareUpdateExclusiveLock'::text, 'ShareLock'::text),
> ('ShareUpdateExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
> ('ShareUpdateExclusiveLock'::text, 'ExclusiveLock'::text),
> ('ShareUpdateExclusiveLock'::text, 'AccessExclusiveLock'::text),
> ('ShareLock'::text, 'RowExclusiveLock'::text),
> ('ShareLock'::text, 'ShareUpdateExclusiveLock'::text),
> ('ShareLock'::text, 'ShareRowExclusiveLock'::text),
> ('ShareLock'::text, 'ExclusiveLock'::text),
> ('ShareLock'::text, 'AccessExclusiveLock'::text),
> ('ShareRowExclusiveLock'::text, 'RowExclusiveLock'::text),
> ('ShareRowExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
> ('ShareRowExclusiveLock'::text, 'ShareLock'::text),
> ('ShareRowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
> ('ShareRowExclusiveLock'::text, 'ExclusiveLock'::text),
> ('ShareRowExclusiveLock'::text, 'AccessExclusiveLock'::text),
> ('ExclusiveLock'::text, 'RowShareLock'::text),
> ('ExclusiveLock'::text, 'RowExclusiveLock'::text),
> ('ExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
> ('ExclusiveLock'::text, 'ShareLock'::text),
> ('ExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
> ('ExclusiveLock'::text, 'ExclusiveLock'::text),
> ('ExclusiveLock'::text, 'AccessExclusiveLock'::text),
> ('AccessExclusiveLock'::text, 'AccessShareLock'::text),
> ('AccessExclusiveLock'::text, 'RowShareLock'::text),
> ('AccessExclusiveLock'::text, 'RowExclusiveLock'::text),
> ('AccessExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
> ('AccessExclusiveLock'::text, 'ShareLock'::text),
> ('AccessExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
> ('AccessExclusiveLock'::text, 'ExclusiveLock'::text),
> ('AccessExclusiveLock'::text, 'AccessExclusiveLock'::text)
> ),
> l AS
> (
> SELECT
> (locktype,DATABASE,relation::regclass::text,page,tuple,virtualxid,transactionid,classid,objid,objsubid) AS target,
> virtualtransaction,
> pid,
> mode,
> granted
> FROM pg_catalog.pg_locks
> ),
> t AS
> (
> SELECT
> blocker.target AS blocker_target,
> blocker.pid AS blocker_pid,
> blocker.mode AS blocker_mode,
> blocked.target AS target,
> blocked.pid AS pid,
> blocked.mode AS mode
> FROM l blocker
> JOIN l blocked
> ON ( NOT blocked.granted
> AND blocker.granted
> AND blocked.pid != blocker.pid
> AND blocked.target IS NOT DISTINCT FROM blocker.target)
> JOIN c ON (c.requested = blocked.mode AND c.current = blocker.mode)
> ),
> r AS
> (
> SELECT
> blocker_target,
> blocker_pid,
> blocker_mode,
> '1'::int AS depth,
> target,
> pid,
> mode,
> blocker_pid::text || ',' || pid::text AS seq
> FROM t
> UNION ALL
> SELECT
> blocker.blocker_target,
> blocker.blocker_pid,
> blocker.blocker_mode,
> blocker.depth + 1,
> blocked.target,
> blocked.pid,
> blocked.mode,
> blocker.seq || ',' || blocked.pid::text
> FROM r blocker
> JOIN t blocked
> ON (blocked.blocker_pid = blocker.pid)
> WHERE blocker.depth < 1000
> )SELECT * FROM r
> ORDER BY seq;
> (0 rows)
>
>
> no finding here either :(
>
> best,
>
> /mstelios
>
>
> Stelios Mavromichalis
> Cytech Ltd. - http://www.cytech.gr/
> Science & Technology Park of Crete
> fax: +30 2810 31 1045
> tel.: +30 2810 31 4127
> mob.: +30 697 7078013
> skype: mstelios
>
>
> On Tue, May 6, 2014 at 2:54 AM, David G Johnston <
> david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>
>> Stelios Mavromichalis wrote
>> > as a prior step to dump/restore i am thinking of deleting and
>> re-inserting
>> > that particular row. that might share some light you think?
>>
>> I still dislike the randomness of the unresponsiveness...
>>
>> Every time you perform an update you "delete and insert" that row - that
>> is
>> how an update works in MVCC - so doing so explicitly is unlikely to
>> provide
>> any benefit. Since that row is continually being inserted, and no other
>> rows are having this issue, I'm seriously doubting that a dump/restore is
>> going to have any effect either. Note that the index scan took twice as
>> long in the bad case - but still reasonable and you didn't notice any
>> benefit from a REINDEX. This is what I would expect.
>>
>> The only other difference, if concurrency has been ruled out, is the 4 vs
>> 18
>> buffers that had to be read. I cannot imagine that, since all 22 were in
>> cache, that simply reading that much more data would account for the
>> difference (we're talking a 10,000-fold increase, not 2to4-fold). The
>> reason for this particular difference, IIUC, is how may candidate tuples
>> are
>> present whose visibility has to be accounted for (assuming 1 buffer per
>> tuple, you needed to scan 4 vs 18 for visibility in the two queries).
>>
>> Is there any log file information you can share? Especially if you can
>> set
>> log_min_statement_duration (or whatever that GUC is named) so that
>> whenever
>> one of these gets adversely delayed it appears in the log along with
>> whatever other system messages are being sent. Checkpoints are a typical
>> culprit though that should be affecting a great deal more than what you
>> indicate you are seeing.
>>
>> I'm pretty certain you are seeing this here largely because of the
>> frequency
>> of activity on this particular user; not because the data itself is
>> corrupted. It could be some kind of symptom of internal concurrency that
>> you just haven't observed yet but it could also be I/O or other system
>> contention that you also haven't properly instrumented. Unfortunately
>> that
>> is beyond my current help-providing skill-set.
>>
>> A dump-restore likely would not make anything worse though I'd be
>> surprised
>> if it were to improve matters. It also doesn't seem like hardware -
>> unless
>> the RAM is bad. Software bugs are unlikely if this had been working well
>> before 5 days ago. So, you need to observe the system during both
>> periods
>> (good and bad) and observe something that is different - probably not
>> within
>> PostgreSQL if indeed you've minimized concurrency. And also see if you
>> can
>> see if any other queries, executed during both these times, exhibit a
>> performance decrease. Logging all statements would help matters greatly
>> if
>> you can afford it in your production environment - it would make looking
>> for
>> internal concurrency much easier.
>>
>> David J.
>>
>>
>>
>> --
>> View this message in context:
>> http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802579.html
>> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org
>> )
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2014-05-07 21:35:14 Re: Specifications for a new server
Previous Message Dorian Hoxha 2014-05-06 14:43:48 Re: Specifications for a new server