Re: PG performance issues related to storage I/O waits

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: PG performance issues related to storage I/O waits
Date: 2013-08-05 20:28:37
Message-ID: 52000AF5.2040100@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

On 5.8.2013 17:55, Tasos Petalas wrote:
>
> Seems most of the I/O is caused by SELECT backend processes (READ),
> whereas (WRITE) requests of wal writer and checkpointer processes do
> not appear as top IO proceses (correct me if I am wrong)
>
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>
> 14:09:40 769 be/4 enterpri 0.00 B/s 33.65 M/s 0.00 % 0.00 %
> postgres: wal writer process

That's because the WAL writer does sequential I/O (writes), which is a
perfect match for SAS drives.

OTOH the queries do a lot of random reads, which is a terrible match for
spinners.

> That however still doesn't say which processes are responsible
> for that.
> Is that background writer, backends running queries or what? The
> iotop
> should give you answer to this (or at least a hint).
>
>
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)

Yes, that seems to be the case.

> Also, how are these volumes defined? Do they use distinct sets
> of disks?
> How many disks are used for each volume?
>
>
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)

I do understand these are LUNs from the SAN. I was asking whether there
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).

Although I doubt it'd be possible to use the same disk for two LUNs.

> > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
> to sar
> > output for disks)
>
> OK, so the pg_archives is probably for xlog archive, right?
>
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')

... which is exactly what WAL archive is. That's why the GUC is called
archive_command.

> I've checked the conf, and I think you should really consider
> increasing
> checkpoint_segments - it's set to 3 (= 64MB) but I think
> something like
> 32 (=512MB) or even more would be more appropriate.
>
> We use EDB dynatune. Actual setting can be found in file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64

OK, I'm not familiar with dynatune, and I got confused by the
postgresql.conf that you sent. 64 seems fine to me.

> I see you've enabled log_checkpoints - can you check your logs
> how often
> the checkpoints happen?
>
>
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG: checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG: checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s

Meh, seems OK to me. This was based on the incorrect number of
checkpoint segments ...
>
>
>
> Also, can you check pg_stat_bgwriter view? I'd bet the value in
> checkpoints_timed is very low, compared to checkpoints_req. Or even
> better, get the values from this view before / after running the
> batch jobs.
>
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058

Again, seems fine.

> In the afternoon it's a different story - for 253-2 it looks
> like this:
>
> DEV tps rd_sec/s wr_sec/s await %util
> 15:50:01 dev253-2 4742.91 33828.98 29156.17 84.84 105.14
> 16:00:01 dev253-2 2781.05 12737.41 18878.52 19.24 80.53
> 16:10:01 dev253-2 3661.51 20950.64 23758.96 36.86 99.03
> 16:20:01 dev253-2 5011.45 32454.33 31895.05 72.75 102.38
> 16:30:01 dev253-2 2638.08 14661.23 17853.16 25.24 75.64
> 16:40:01 dev253-2 1988.95 5764.73 14190.12 45.05 58.80
> 16:50:01 dev253-2 2185.15 88296.73 11806.38 7.46 84.37
> 17:00:01 dev253-2 2031.19 12835.56 12997.34 8.90 82.62
> 17:10:01 dev253-2 4009.24 34288.71 23974.92 38.07 103.01
> 17:20:01 dev253-2 3605.86 26107.83 22457.41 45.76 90.90
> 17:30:01 dev253-2 2550.47 7496.85 18267.07 19.10 65.87
>
>
> This is when the actual problem arises

Well, then I think it's mostly about the SELECT queries.

> What I think you could/should do:
>
> * move pg_xlog to a separate device (not simply a volume on the SAN,
> sharing disks with the other volumes - that won't give you
> anything)
>
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
> I'd expect these changes to improve the afternoon peak, as it's
> doing
> about 50% writes. However I would not expect this to improve the
> morning
> peak, because that's doing a lot of reads (not writes).
>
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)

OK, understood. It's difficult to predict the gain and given the iotop
output it might even cause harm.

>
> Will SSD improve write performance? We are thinking of moving towards
> this direction.

It'll certainly improve the random I/O in general, which is the main
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.

Tomas

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ivan Voras 2013-08-05 23:04:10 ORDER BY, LIMIT and indexes
Previous Message slapo 2013-08-05 08:14:45 Re: Sub-optimal plan for a paginated query on a view with another view inside of it.