Re: Momentary Delay

From: Anderson Valadares <andervalbh(at)gmail(dot)com>
To: Bill Moran <wmoran(at)potentialtech(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Momentary Delay
Date: 2015-06-17 13:33:37
Message-ID: CALaY=XkjYx8ftVx7AJwXOm3nHVR+-VwpNF1wdq20Q3G1-J7A0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2015-06-15 18:19 GMT-03:00 Bill Moran <wmoran(at)potentialtech(dot)com>:

> On Mon, 15 Jun 2015 17:48:54 -0300
> Anderson Valadares <andervalbh(at)gmail(dot)com> wrote:
>
> > 2015-06-12 19:56 GMT-03:00 Bill Moran <wmoran(at)potentialtech(dot)com>:
> >
> > > Please do not remove the mailing list from replies. See below.
> > >
> > > On Fri, 12 Jun 2015 09:21:19 -0300
> > > Anderson Valadares <andervalbh(at)gmail(dot)com> wrote:
> > >
> > > > 2015-06-08 20:33 GMT-03:00 Bill Moran <wmoran(at)potentialtech(dot)com>:
> > > >
> > > > > On Mon, 8 Jun 2015 11:59:31 -0300
> > > > > Anderson Valadares <andervalbh(at)gmail(dot)com> wrote:
> > > > >
> > > > > > Hi
> > > > > > We are experiencing an intermittent problem in a GIS database
> from a
> > > > > > client. Some times of the day is as if the PostgreSQL executed
> the
> > > > > slowest
> > > > > > operations see below an example. The query was performed three
> times,
> > > > > twice
> > > > > > I canceled and the third time I left it to end. The consultation
> > > took 10
> > > > > > seconds to finish, but the measured execution time is 20 ms. As
> we
> > > have
> > > > > > several queries that run every 2 seconds when this momentary
> delay
> > > occurs
> > > > > > queries accumulate and burst the limit of 203 connections
> allowed.
> > > The
> > > > > > interval of "momentary stop" are approximately 2 seconds
> occurring at
> > > > > > random times and during these stoppages occur no locks and no
> > > increased
> > > > > > load on the server is identified. There is a job 2/2 seconds
> locks
> > > > > > collection of information, running queries, etc., nmon also
> collects
> > > > > > information every 2 seconds. My client asks what causes these
> > > momentary
> > > > > > stops? because it affects all operations of the database? etc.
> How
> > > do I
> > > > > > identify what is causing these delays in executions of
> operations in
> > > the
> > > > > > database?
> > > > >
> > > > > Number of points to consider:
> > > > > * Check the contents of pg_locks and see if something is waiting
> on a
> > > > > lock when the query is slow.
> > > >
> > > > There is nothing waiting when the query is slow, see:
> > > >
> > > > snap_datetime | waiting | count
> > > > -------------------------------+---------+-------
> > > > 2015-06-05 09:25:00.954731-03 | f | 74
> > > > 2015-06-05 09:26:00.249187-03 | f | 205
> > > > 2015-06-05 09:27:00.826874-03 | f | 207
> > >
> > > I don't know what that means. Since you don't show the query that
> > > generated that output, I have no idea if your statement is valid, or
> > > if you're running a query that will inherently produce incorrect
> > > results.
> >
> > Sorry the information was incomplete. Below is the query that was
> > performed to extract information from tables that used to monitor the
> > database.
> > The snap_datetime column indicates the time that occurred monitoring,
> > the waiting column tells you whether any connection was on hold and the
> > column
> > count tells how many connections existed at the time.
> >
> > select
> > s.snap_datetime
> > ,a.waiting
> > , count(*)
> > from stat_snapshot s, stat_activity a
> > where s.snap_id = a.snap_id
> > and s.snap_datetime >= '2015-06-05 09:25:00'
> > and s.snap_datetime <= '2015-06-05 09:28:00'
> > group by
> > s.snap_datetime
> > ,a.waiting
> > order by s.snap_datetime
> > ;
> > snap_datetime | waiting | count
> > -------------------------------+---------+-------
> > 2015-06-05 09:25:00.954731-03 | f | 74
> > 2015-06-05 09:26:00.249187-03 | f | 205
> > 2015-06-05 09:27:00.826874-03 | f | 207
> > (3 rows)
>
> Given what I'm seeing above, there's still a lot of information
> missing. Those tables are not standard PostgreSQL tables, and I'm
> not aware of any tool that creates them, thus I have to assume
> it's a cron job or something similar created in-house. Given the
> query you ran and the resultant data, my first guess is that the
> data in stat_snapshot and stat_activity is worthless: i.e. it's
> captured once per minute, and therefore there are 59+ seconds
> worth of detail that aren't captured, thus the actual liklihood
> that the those tables will contain any indication of the problem
> is very low.
>

Yes it is a cron, created at home, running every 1 hour gathering
information from PostgreSQL tables (pg_stat_activity, pg_locks, etc).
How to improve the collection of information PostgreSQL?

>
> > > > * Also, ensure log_lock_waits is turned on for the server and check
> > > > > PostgreSQL's logs for messages about long lock waits.
> > > >
> > > > There is also nothing
> > >
> > > This is even worse than the previous statement. What did you _do_ to
> > > determine this?
> > >
> > > Since you're unable to diagnose and fix the problem on your own, the
> > > possibility exists that the problem is with your diagnostic steps. If
> > > you don't share the details of those steps, I have no way to help you
> > > with them.
> >
> > See the log in
> >
> https://github.com/andervalbh/serv-gis01/blob/master/postgresql-2015-06-05_000000.log
>
> Apologies, my above comment was intended to be in regards to the IO load.
>
> However, that log shows that you have massive lock contention. In the early
> part of the log a process waits over 5 minutes for a share lock:
> 2015-06-05 05:51:32 BRT [44236822] - [10.57.1.26] [14-1]
> db=citgis,user=opegis LOG: process 44236822 acquired ShareLock on
> transaction 3090377758 after 374146.817 ms
>
> Without seeing the source of the PL/pgSQL functions that seem to be
> causing the problem, I can only suggest that you review those functions
> for excessive locking and inefficient access. I'm going to guess, with
> the large number of AccessExclusiveLocks in the log, that the functions
> are doing explicit locking, and not doing so in an efficient manner.
>

This function PL / pgSQL has been fixed, but see that during the period in
which the SQL presenting slowness problem (between 9:26:00 and 9:27:00)
there is no block occurrences.

> > > > > * Based on the connection behavior you describe, I'm guessing it's
> a
> > > > > Tomcat app using some sort of conection pool. Whatever it is,
> tune
> > > > > your connection pool settings so that the max size of the pool
> > > doesn't
> > > > > exceed the available PostgreSQL connections. At least that will
> > > prevent
> > > > > errors from happening when the problem occurs.
> > > >
> > > > Yes it is a tomcat application using connection pooling and will be
> set
> > > to
> > > > not exceed the limit of PostgreSQL connections, but the central
> point is
> > > > that the number of connections has increased due to what I am calling
> > > > "momentary stoppages"
> > >
> > > I understand that. I made the suggestion as a temporary fix to reduce
> > > the impact until you are able to fix the actual cause.
> > >
> > > > > * The query you you keep cancelling below, run just EXPLAIN on it
> (not
> > > > > EXPLAIN ANALYZE) which will certainly finish and give you a plan
> > > > > that can be reviewed to help determine what the problem is.
> > > >
> > > > Explain plan in http://explain.depesz.com/s/bWw
> > >
> > > I'm confused. The query at that URL only took 10ms to complete, which
> > > doesn't seem at all unreasonable.
> >
> > I agree with you and that's why I ask why the delay in the execution at
> > certain times
> > as demonstrated above.
> >
> > > > * On the OS, monitor iostat or something else to see if you're
> > > > > saturating disk capacity when the problem happens.
> > > >
> > > > The OS is also being monitored and there occurs no saturation.
> > >
> > > Again, without any details, it's possible that there is a problem
> > > in your monitoring.
> > >
> > >
> > See the monitoring spreadsheet generated from nmon in
> >
> https://github.com/andervalbh/serv-gis01/blob/master/transfacil07_150605_0000.nmon.rar
> >
> >
> > >
> > > > > The Server is a IBM P720 128G RAM PostgreSQL 9.2.9 on
> > > > > > powerpc-ibm-aix7.1.0.0, compiled by gcc (GCC) 4.6.4, 64-bit
> > > > > >
> > > > > > Evolution of the number of connections for a period
> > > > > >
> > > > > > snap_datetime | #connections
> > > > > > ----------------------------------------------+-------------
> > > > > > 2015-06-05 09:25:00.954731-03 | 74
> > > > > > 2015-06-05 09:26:00.249187-03 | 205
> > > > > > 2015-06-05 09:27:00.826874-03 | 207
> > > > > > 2015-06-05 09:28:00.374666-03 | 73
> > > > > > 2015-06-05 09:29:00.690696-03 | 75
> > > > > >
> > > > > >
> > > > > > Occurrence of the problem
> > > > > >
> > > > > > citgis=# select max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > Cancel request sent
> > > > > > ERROR: canceling statement due to user request
> > > > > > Time: 2555.557 ms
> > > > > > citgis=# select max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > Cancel request sent
> > > > > > ERROR: canceling statement due to user request
> > > > > > Time: 2192.572 ms
> > > > > > citgis=# explain analyze select
> max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > Cancel request sent
> > > > > > ERROR: canceling statement due to user request
> > > > > > Time: 10164.015 ms
> > > > > > citgis=# select now();
> > > > > > now
> > > > > > -------------------------------
> > > > > > 2015-06-05 09:27:22.006072-03
> > > > > > (1 row)
> > > > > >
> > > > > > Time: 0.152 ms
> > > > > > citgis=# select max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > dvi_cod_dados_via_iti | via_cod_viagem
> > > > > > -----------------------+----------------
> > > > > > 1059964443 | 7989813
> > > > > > 1060072723 | 7990876
> > > > > > (2 rows)
> > > > > >
> > > > > > Time: 5565.175 ms
> > > > > > citgis=# select max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > dvi_cod_dados_via_iti | via_cod_viagem
> > > > > > -----------------------+----------------
> > > > > > 1059964445 | 7989813
> > > > > > 1060072725 | 7990876
> > > > > > (2 rows)
> > > > > >
> > > > > > Time: 27.944 ms
> > > > > > citgis=# select max(dvi_sub.dvi_cod_dados_via_iti) as
> > > > > dvi_cod_dados_via_iti,
> > > > > > dvi_sub.via_cod_viagem
> > > > > > from gis_dvi_dados_viagem_itinerario dvi_sub,
> > > gis_via_viagem
> > > > > > via_sub, gis_iti_itinerario iti_sub
> > > > > > where dvi_sub.via_cod_viagem =
> via_sub.via_cod_viagem
> > > > > > and via_sub.via_status_viagem = 'A'
> > > > > > and via_sub.via_dt_hora_ini > now() - interval '9
> > > hours'
> > > > > > and iti_sub.lin_cod_linha = 389
> > > > > > and iti_sub.iti_sentido_itinerario = 'I'
> > > > > > and iti_sub.iti_cod_itinerario_linha =
> > > > > > via_sub.iti_cod_itinerario_linha
> > > > > > and dvi_sub.vlo_cod_localizacao is not null
> > > > > > group by dvi_sub.via_cod_viagem,
> > > > > > iti_sub.iti_cod_itinerario_linha, via_sub.equ_cod_eqpto
> > > > > > ;
> > > > > > dvi_cod_dados_via_iti | via_cod_viagem
> > > > > > -----------------------+----------------
> > > > > > 1059964445 | 7989813
> > > > > > 1060072727 | 7990876
> > > > > > (2 rows)
> > > > > >
> > > > > > Time: 24.428 ms
> > > > > >
> > > > > > Greetings
> > > > > >
> > > > > > Anderson
> > > > >
> > > > >
> > > > > --
> > > > > Bill Moran <wmoran(at)potentialtech(dot)com>
> > > > >
> > >
> > >
> > > --
> > > Bill Moran <wmoran(at)potentialtech(dot)com>
> > >
>
>
> --
> Bill Moran <wmoran(at)potentialtech(dot)com>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Filipe Pina 2015-06-17 13:40:17 Re: serialization failure why?
Previous Message Kevin Grittner 2015-06-17 11:52:05 Re: serialization failure why?