Re: PG performance issues related to storage I/O waits

From: Tasos Petalas <tasos(dot)petalas(at)upstreamsystems(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>, Tasos Petalas <tasos(dot)petalas(at)upstreamsystems(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: PG performance issues related to storage I/O waits
Date: 2013-08-06 04:46:58
Message-ID: CAJtGb8rCU1jgsHF0yJyj+_904YWfZvpt+kSc71Qd6VgWyQTwNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Aug 5, 2013 at 11:28 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> Hi,
>
> On 5.8.2013 17:55, Tasos Petalas wrote:
> >
> > Seems most of the I/O is caused by SELECT backend processes (READ),
> > whereas (WRITE) requests of wal writer and checkpointer processes do
> > not appear as top IO proceses (correct me if I am wrong)
> >
> > E.g. check the follwoing heavy write process that reports 0% I/O ...!
> >
> > 14:09:40 769 be/4 enterpri 0.00 B/s 33.65 M/s 0.00 % 0.00 %
> > postgres: wal writer process
>
> That's because the WAL writer does sequential I/O (writes), which is a
> perfect match for SAS drives.
>
> OTOH the queries do a lot of random reads, which is a terrible match for
> spinners.
>
> > That however still doesn't say which processes are responsible
> > for that.
> > Is that background writer, backends running queries or what? The
> > iotop
> > should give you answer to this (or at least a hint).
> >
> >
> > It seems most of I/O reported from backends running heavy concurrent
> > select queries (See iotop attachment in previous email)
>
> Yes, that seems to be the case.
>
> > Also, how are these volumes defined? Do they use distinct sets
> > of disks?
> > How many disks are used for each volume?
> >
> >
> > These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> > topology in Storage)
>
> I do understand these are LUNs from the SAN. I was asking whether there
> are separate sets of disks for the data directory (which you mentioned
> to be RAID-10) and pg_archives (which you mentioned to be RAID-5).
>
> Although I doubt it'd be possible to use the same disk for two LUNs.
>

Sorry I didn't get you question right. Yes there are different disk sets
for RAID-10 (data) and RAID-5 (wal archives)

>
> > > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
> > to sar
> > > output for disks)
> >
> > OK, so the pg_archives is probably for xlog archive, right?
> >
> > NO.
> > /pg_archives is the target mount_point where we copy archive_logs to
> > (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')
>
> ... which is exactly what WAL archive is. That's why the GUC is called
> archive_command.
>

Again misunderstood your question. I wrongly got you're asking for separate
LUN for WAL (pg_xlog to a separate device and not WAL archives)

>
> > I've checked the conf, and I think you should really consider
> > increasing
> > checkpoint_segments - it's set to 3 (= 64MB) but I think
> > something like
> > 32 (=512MB) or even more would be more appropriate.
> >
> > We use EDB dynatune. Actual setting can be found in file
> > (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> > section
> > Current checkpoint_segments is set to 64
>
> OK, I'm not familiar with dynatune, and I got confused by the
> postgresql.conf that you sent. 64 seems fine to me.
>

Understood. EDB dynatune is a specific feature that ships with EDB PG
versions and suppose to take care of most of the PG conf parameters (found
in postgresql.conf) automatically and adjust them in run time (You can
always override them).

"Show all" command in psql promt gives you the actual values at any given
time.

> > I see you've enabled log_checkpoints - can you check your logs
> > how often
> > the checkpoints happen?
> >
> >
> > This is the output of the checkpoints during peak hours (avg. every 2-5
> > minutes)
> >
> > 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG: checkpoint complete:
> > wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> > 41 recycled; write=220.619 s, sync=
> > 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024
> s
> > 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG: checkpoint complete:
> > wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> > removed, 31 recycled; write=209.714 s, syn
> > c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s,
> average=0.042 s
>
> Meh, seems OK to me. This was based on the incorrect number of
> checkpoint segments ...
> >
> >
> >
> > Also, can you check pg_stat_bgwriter view? I'd bet the value in
> > checkpoints_timed is very low, compared to checkpoints_req. Or
> even
> > better, get the values from this view before / after running the
> > batch jobs.
> >
> > Results during load:
> > checkpoints_timed : 12432 , checkpoints_req = 3058
>
> Again, seems fine.
>
>
Update values for pg_stat_bgwriter after batch activity (off-peak)
checkpoints_timed : 12580 checkpoints_req : 3070

I don't see any significant difference here.

> > In the afternoon it's a different story - for 253-2 it looks
> > like this:
> >
> > DEV tps rd_sec/s wr_sec/s await
> %util
> > 15:50:01 dev253-2 4742.91 33828.98 29156.17 84.84
> 105.14
> > 16:00:01 dev253-2 2781.05 12737.41 18878.52 19.24
> 80.53
> > 16:10:01 dev253-2 3661.51 20950.64 23758.96 36.86
> 99.03
> > 16:20:01 dev253-2 5011.45 32454.33 31895.05 72.75
> 102.38
> > 16:30:01 dev253-2 2638.08 14661.23 17853.16 25.24
> 75.64
> > 16:40:01 dev253-2 1988.95 5764.73 14190.12 45.05
> 58.80
> > 16:50:01 dev253-2 2185.15 88296.73 11806.38 7.46
> 84.37
> > 17:00:01 dev253-2 2031.19 12835.56 12997.34 8.90
> 82.62
> > 17:10:01 dev253-2 4009.24 34288.71 23974.92 38.07
> 103.01
> > 17:20:01 dev253-2 3605.86 26107.83 22457.41 45.76
> 90.90
> > 17:30:01 dev253-2 2550.47 7496.85 18267.07 19.10
> 65.87
> >
> >
> > This is when the actual problem arises
>
> Well, then I think it's mostly about the SELECT queries.
>
> > What I think you could/should do:
> >
> > * move pg_xlog to a separate device (not simply a volume on the
> SAN,
> > sharing disks with the other volumes - that won't give you
> > anything)
> >
> > Unfortunately we cannot do so at the moment (alll available SAN
> > resources are assigned to the pg_data directory of the server)
> >
> > I'd expect these changes to improve the afternoon peak, as it's
> > doing
> > about 50% writes. However I would not expect this to improve the
> > morning
> > peak, because that's doing a lot of reads (not writes).
> >
> > Afternoon peak is what we need to troubleshoot (will check if we can
> > assign pg_xlog to a different LUN - not an option currently)
>
> OK, understood. It's difficult to predict the gain and given the iotop
> output it might even cause harm.
>
> >
> > Will SSD improve write performance? We are thinking of moving towards
> > this direction.
>
> It'll certainly improve the random I/O in general, which is the main
> issue with SELECT queries. Sequential read/write improvement probably
> won't be that significant.
>
> Tomas
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Ivan Voras 2013-08-06 10:04:13 Re: ORDER BY, LIMIT and indexes
Previous Message David Johnston 2013-08-06 01:54:28 Re: ORDER BY, LIMIT and indexes