Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

From: Lonni J Friedman <netllama(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Eduardo Morras <emorrasg(at)yahoo(dot)es>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: upgrade from 9.2.x to 9.3 causes significant performance degradation
Date: 2013-09-17 20:12:33
Message-ID: CAP=oouH+aBQ=DSNr9NLf6VP05AjRb+GNomDY5nQHSaGy5q1qWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for your reply. Comments/answers inline below....

On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama(at)gmail(dot)com>
> wrote:
>>
>>
>> > c) What does logs say?
>>
>> The postgres server logs look perfectly normal, minus a non-trivial
>> slower run time for most queries. There's nothing unusual in any of
>> the OS level logs (/var/log/messages, etc) or dmesg.
>
>
> Are you generally CPU limited or IO limited?

top shows over 90% of the load is in sys space. vmstat output seems
to suggest that its CPU bound (or bouncing back & forth):
########
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 17308 852016 141104 127074192 0 0 10 18 0 0
6 4 90 0 0
0 0 17308 872316 141104 127074200 0 0 0 988 940 564
1 0 99 0 0
0 0 17308 884288 141104 127074208 0 0 0 1921 1202 2132
1 0 99 0 0
0 0 17308 898728 141104 127074208 0 0 0 0 1064 577
1 0 99 0 0
2 0 17308 914920 141104 127074224 0 0 0 44 820 427
1 0 99 0 0
0 0 17308 926524 141104 127074272 0 0 0 48 1173 585
1 0 99 0 0
108 1 17308 753648 141104 127074224 0 0 0 236 9825 3901
12 5 83 0 0
50 0 17308 723156 141104 127074400 0 0 0 144 43481 9105
20 79 1 0 0
45 0 17308 722860 141104 127074416 0 0 0 8 32969 1998
1 97 2 0 0
47 0 17308 738996 141104 127074416 0 0 0 0 34099 1739
1 99 0 0 0
101 0 17308 770220 141104 127074480 0 0 0 32 38550 5998
7 93 0 0 0
101 0 17308 775732 141104 127074512 0 0 0 156 33889 5809
4 96 0 0 0
99 0 17308 791232 141104 127074544 0 0 0 0 32385 4981
0 100 0 0 0
96 0 17308 803156 141104 127074544 0 0 0 24 32413 4824
0 100 0 0 0
87 0 17308 811624 141104 127074544 0 0 0 0 32438 4470
0 100 0 0 0
83 0 17308 815500 141104 127074544 0 0 0 0 32489 4159
0 100 0 0 0
80 0 17308 826572 141104 127074552 0 0 0 33 32582 3948
0 100 0 0 0
73 0 17308 853264 141108 127074552 0 0 0 52 32833 3840
0 100 0 0 0
73 0 17308 882240 141108 127074560 0 0 0 4 32820 3594
0 100 0 0 0
72 0 17308 892256 141108 127074560 0 0 0 0 32368 3516
0 100 0 0 0
#######

iostat consistently shows %util under 1.00 which also suggests that
disk IO is not the bottleneck:
#########
iostat -dx /dev/sdb 5
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_
(32 CPU)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.02 0.21 6.91 31.33 651.60 1121.85
46.38 0.09 2.25 0.08 0.31

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 8.00 0.00 93.00
11.62 0.00 0.28 0.20 0.16

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 11.00 0.00 125.40
11.40 0.00 0.16 0.16 0.18

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 105.00 0.00 3380.40
32.19 0.29 2.76 0.03 0.34

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 14.80 0.00 2430.60
164.23 0.00 0.12 0.09 0.14

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1.20 0.00 41.60 0.00 1819.40
43.74 0.02 0.45 0.05 0.20

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 2.80 0.00 32.00
11.43 0.00 0.00 0.00 0.00
#########

mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time:
#########
mpstat 2 10
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_
(32 CPU)

12:53:19 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle
12:53:21 PM all 7.36 0.00 92.58 0.00 0.00 0.03
0.00 0.00 0.03
12:53:23 PM all 6.35 0.00 90.43 0.00 0.00 0.03
0.00 0.00 3.19
12:53:25 PM all 3.13 0.00 68.20 0.00 0.00 0.02
0.00 0.00 28.66
12:53:27 PM all 6.07 0.00 68.46 0.00 0.00 0.03
0.00 0.00 25.44
12:53:29 PM all 5.83 0.00 94.14 0.00 0.00 0.03
0.00 0.00 0.00
12:53:31 PM all 5.75 0.00 94.14 0.00 0.00 0.11
0.00 0.00 0.00
12:53:33 PM all 7.65 0.00 40.32 0.00 0.00 0.03
0.00 0.00 52.00
12:53:35 PM all 3.78 0.00 0.94 0.00 0.00 0.03
0.00 0.00 95.25
12:53:37 PM all 16.09 0.00 14.33 0.02 0.00 0.05
0.00 0.00 69.51
12:53:39 PM all 5.60 0.00 10.38 0.00 0.00 0.05
0.00 0.00 83.97
Average: all 6.76 0.00 57.41 0.00 0.00 0.04
0.00 0.00 35.79
#########

So unless I've misunderstood the output above, all of this suggests
that the CPU is the bottleneck. If you'd like some other data let me
know and I'll be happy to capture it.

> Grab one of those slower queries and do an "explain (analyze, buffers)" of
> it. Preferably with track_io_timing on.

Just to be clear, just about every query is slow, not just certain
queries. The entire server is slow, and the only thing that has
changed is the version of postgres (granted, not a trivial change).
Anyway, I picked a random query, connected with psql and ran the
following:
########
nightly=# SET track_io_timing=on ;
SET

nightly=# EXPLAIN (analyze,buffers) SELECT count(*) FROM
pg_catalog.pg_class AS c WHERE c.oid =
pgpool_regclass('dblcufft_testlist') AND c.relpersistence = 'u';
QUERY PLAN

--------------------------------------------------------------------------------------------------------
-------
Aggregate (cost=2779.71..2779.72 rows=1 width=0) (actual
time=83.136..83.136 rows=1 loops=1)
Buffers: shared hit=1633
-> Seq Scan on pg_class c (cost=0.00..2779.70 rows=1 width=0)
(actual time=83.131..83.131 rows=0 loops=1)
Filter: ((relpersistence = 'u'::"char") AND (oid =
pgpool_regclass('dblcufft_testlist'::cstring)))
Rows Removed by Filter: 70393
Buffers: shared hit=1633
Total runtime: 83.181 ms
(7 rows)
########

I'm not quite sure that I understand how to interpret this output. I
didn't have track_io_timings enabled by default. Does it require a
restart to enable? The official dox seem to suggest that there
should be a number of additional fields, which I'm not seeing.

>
> Do you still have test/dev/QA/etc copy running on 9.2 for comparison?

Not live. I'd have to restore from a backup, and my test environment
is not the same class of hardware as production. I could get that
process underway if you wanted. What did you have in mind?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gregory Haase 2013-09-17 20:18:28 Re: ZFS snapshots - to use pg_start_backup() and pg_stop_backup() or no?
Previous Message Rowan Collins 2013-09-17 19:55:37 Re: Connect postgres to SQLSERVER