Quick Links

Re: Cannot reproduce why a query is slow

From:	John Cheng <johnlicheng(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Cannot reproduce why a query is slow
Date:	2011-07-27 19:56:09
Message-ID:	CAJzZBAQn+7NU9WYv1pFJSSBxnJhkb934a0O45KUoxAbDhuh4Tw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Sorry about the long wait between reply.

On Sat, May 7, 2011 at 2:36 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> Resource usually means there's too much I/O so the query is slow, but
> when you try it later the drives are idle and query runs much faster.
> Run some monitoring, e.g. even a simple 'iostat -x' or 'dstat' output
> might be sufficient. Once the slow query happens, try to correlate it to
> the CPU / disk activity.
>

I am unsure if disk IO could be a part of the problem. iostat shows a max of
~2000kb/second write. This query gets during a batch of insert/update
process (every 5 minutes or so, we batch process a list of 20 or so xmls,
i.e., "leads"). During the batch processing, iostat shows pretty low usage.

avg-cpu: %user %nice %system %iowait %steal %idle
4.38 0.00 0.12 0.62 0.00 94.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.99 0.00 5.94 0.00 55.45 9.33
0.10 16.17 6.17 3.66
sdb 0.00 172.28 0.00 11.88 0.00 1473.27 124.00
0.03 2.42 1.08 1.29
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
8.35 0.00 0.37 0.37 0.00 90.90

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdb 0.00 621.00 0.00 23.00 0.00 5152.00 224.00 0.08
3.52 1.22 2.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
9.62 0.00 0.12 1.88 0.00 88.38

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdb 0.00 73.00 0.00 9.00 0.00 656.00 72.89
0.01 1.33 1.00 0.90
sdc 0.00 200.00 20.00 129.00 304.00 2632.00 19.70
1.22 8.16 0.68 10.20
sdd 0.00 0.00 9.00 0.00 120.00 0.00 13.33
0.04 4.67 4.56 4.10

The 'atop' command shows low disk activity ~2%, but I'm not sure how
accurate atop is. If I were to run a similar query from the command line
while the batch job is running, I still get a 50ms execution time.

CPU usage does goes up a lot during this time. There are 8 cores and 1 of
the cores shoots up to ~97%. The process updates XML records in the database
and performs dozens of xpath based queries to pull data out of the xml into
relational tables. These extractions are performed with pgplsql triggers and
Postgres xpath functions. Hence the spike in CPU usage during that time.

>
> The unexpected plan change is usually caused by autovacuum/autoanalyze
> collecting skewed data for some reason, resulting in bad plan choice.
> Then the autovacuum runs again and you get different (much better) plan.
> This can be detected using the auto_explain contrib module, as someone
> already recommended.
>

We are using Postgres 8.3 and I don't believe the autoexplain contrib module
is available for 8.3.

>
> > The query is a bit complex, as it is generated by code, but I tried to
> > format it for easier reading. I've also replaced actual data with fake
> > data to protected personal information.
>
> I generally do recommend using explain.depesz.com to post explain plans,
> especially in case of complex queries. I've posted your query and this
> is the result
>
> http://explain.depesz.com/s/gJO
>
> Not sure if it's relevant to your issue (probably not), but the bitmap
> index scans are significantly overestimated. Not sure if the overlap
> operator affects the estimate accuracy ...
>
> BTW what postgresql version is this? How large the database is, how much
> RAM is available? What is the size of shared_buffers?
>

I am loving explain.depesz.com!

We are using Postgres 8.3.This system has 16GB of RAM. The database is
114GB. shared_buffers is set to 2000MB. A possibly related note... the
'effective_cache_size' setting. The previous admin set it to 8000MB. The
free command shows 13.8GB cache, 397MB buffer. I wonder if changing the
effective_cache_size can improve query planning.

In response to

Re: Cannot reproduce why a query is slow at 2011-05-07 21:36:51 from Tomas Vondra

Browse pgsql-general by date

	From	Date	Subject
Next Message	Peter V	2011-07-27 20:18:06	WITH x AS (...) and visibility in UPDATE
Previous Message	Rob Sargent	2011-07-27 18:46:56	Re: error when compiling a c function