Re: Very poor read performance, query independent

From: Charles Nadeau <charles(dot)nadeau(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Very poor read performance, query independent
Date: 2017-07-12 10:04:49
Message-ID: CADFyZw4RTBtWUeo-6Q_GcsqawkF+gSe4JGEpASUE6msiu=01xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Jeff,

Here are the 2 EXPLAINs for one of my simplest query:

flows=# SET track_io_timing = on;
LOG: duration: 24.101 ms statement: SET track_io_timing = on;
SET
flows=# explain (analyze, timing off) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size
1073741824
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size
1073741824
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size
639696896
LOG: duration: 2632108.352 ms statement: explain (analyze, timing off)
SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50
loops=1)
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52)
(actual rows=50 loops=1)
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52)
(actual rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) /
'1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport,
(count(*))
Sort Method: quicksort Memory: 563150kB
-> GroupAggregate (cost=37698151.34..37714980.68
rows=2243913 width=52) (actual rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
-> Sort (cost=37698151.34..37699273.29 rows=2243913
width=20) (actual rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr,
flows.dstport
Sort Method: external merge Disk: 2721856kB
-> Gather (cost=19463936.00..37650810.19
rows=2243913 width=20) (actual rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
-> Hash Semi Join
(cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699
loops=10)
Hash Cond: (flows.dstaddr =
flows_1.dstaddr)
-> Nested Loop
(cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045
loops=10)
-> Parallel Seq Scan on flows
(cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970
loops=10)
-> Index Only Scan using
mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8)
(actual rows=1 loops=541559704)
Index Cond: (ipaddr >>=
(flows.srcaddr)::ip4r)
Heap Fetches: 48679396
-> Hash
(cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798
loops=10)
Buckets: 4194304 (originally
16384) Batches: 1 (originally 1) Memory Usage: 141746kB
-> HashAggregate
(cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798
loops=10)
Group Key:
flows_1.dstaddr
-> Nested Loop Anti
Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual
rows=113420172 loops=10)
Join Filter:
(mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by
Join Filter: 453681377
-> Index Only
Scan using flows_srcaddr_dstaddr_idx on flows flows_1
(cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704
loops=10)
Heap
Fetches: 91
-> Materialize
(cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040)
-> Seq Scan
on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual rows=4
loops=10)
Planning time: 62.066 ms
Execution time: 2631923.716 ms
(33 rows)

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size
1073741824
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size
1073741824
LOG: temporary file: path
"pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size
639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers)
SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual
time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244
written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52)
(actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244
written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52)
(actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) /
'1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport,
(count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp
read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68
rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734
loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp
read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913
width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr,
flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133,
temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19
rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988
loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559
read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join
(cost=19462936.00..37622883.23 rows=249324 width=20) (actual
time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr =
flows_1.dstaddr)
Buffers: shared hit=1116588309
read=15851133
I/O Timings: read=5323746.860
-> Nested Loop
(cost=0.03..18159012.30 rows=249324 width=20) (actual
time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813
read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows
(cost=0.00..16039759.79 rows=62330930 width=20) (actual
time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634
read=15851133
I/O Timings:
read=5323746.860
-> Index Only Scan using
mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8)
(actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>=
(flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared
hit=996550152
-> Hash
(cost=19462896.74..19462896.74 rows=11210 width=4) (actual
time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally
16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate
(cost=19462829.48..19462863.11 rows=11210 width=4) (actual
time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key:
flows_1.dstaddr
Buffers: shared
hit=120036496
-> Nested Loop Anti
Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual
time=0.084..831832.333 rows=113420172 loops=10)
Join Filter:
(mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by
Join Filter: 453681377
Buffers: shared
hit=120036496
-> Index Only
Scan using flows_srcaddr_dstaddr_idx on flows flows_1
(cost=0.12..9091067.70 rows=560978368 width=4) (actual
time=0.027..113052.437 rows=541559704 loops=10)
Heap
Fetches: 91
Buffers:
shared hit=120036459
-> Materialize
(cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2
loops=5415597040)
Buffers:
shared hit=10
-> Seq Scan
on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual
time=0.007..0.008 rows=4 loops=10)

Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I
don't think the server was doing anything else. If it is important, I can
repeat the benchmarks.
Thanks!

Charles

On Wed, Jul 12, 2017 at 2:39 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles(dot)nadeau(at)gmail(dot)com>
> wrote:
>
>> Jeff,
>>
>> I used fio in a quick benchmarking script inspired by
>> https://smcleod.net/benchmarking-io/:
>>
>> #!/bin/bash
>> #Random throughput
>> echo "Random throughput"
>> sync
>> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=test --filename=test --bs=4M --iodepth=256 --size=10G
>> --readwrite=randread --ramp_time=4
>> #Random IOPS
>> echo "Random IOPS"
>> sync
>> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=test --filename=test --bs=4k --iodepth=256 --size=4G
>> --readwrite=randread --ramp_time=4
>> #Sequential throughput
>> echo "Sequential throughput"
>> sync
>> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=test --filename=test --bs=4M --iodepth=256 --size=10G
>> --readwrite=read --ramp_time=4
>> #Sequential IOPS
>> echo "Sequential IOPS"
>> sync
>> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=test --filename=test --bs=4k --iodepth=256 --size=4G
>> --readwrite=read --ramp_time=4
>>
>>
> I don't think any of those are directly relevant to PostgreSQL, as it
> doesn't use direct IO, doesn't use libaio, and is rarely going to get
> anywhere near 256 iodepth. So the best they can do is put a theoretical
> ceiling on the performance. Also, random IO with a 4MB stride doesn't make
> any sense from a PostgreSQL perspective.
>
>
>
>>
>> Performing the test you suggested, I get 128.5MB/s. Monitoring the test,
>> I find that the throughput is constant from start to finish and that the
>> iowait is also constant at 5%:
>>
>
> I would have expected it to do better than that. Maybe you increase the
> kernel readahead setting. I've found the default to be much too small.
> But it doesn't make much difference to you, as you appear to be doing
> random IO in your queries, not sequential.
>
>
>> Could you suggest another way to benchmark random reads?
>>
>
> Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which
> is close to what you report. So I think I'd would instead focus on tuning
> your actual queries. You say the problem is not query-dependent, but I
> think that that just means all the queries you looked at are similar. If
> you looked at a query that can't use indexes, like count(unindexed_column)
> from biggest_table; you would find it doing much more IO than 4MB/s.
>
> Can you pick the simplest query you actually care about, and post both an
> "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it?
> (Preferably turning "track_io_timing" on first).
>
> One other question I had, you said you had "2x Intel Xeon E5550", which
> should be 8 CPU (or 16, if the hyperthreads
> are reported as separate CPUs). But you also said: "Also using dstat I
> can see that iowait time is at about 25%". Usually if there is only one
> thing going on on the server, then IOWAIT won't be more than reciprocal of
> #CPU. Is the server busy doing other stuff at the same time you are
> benchmarking it?
>
> Cheers,
>
> Jeff
>

--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Charles Nadeau 2017-07-12 13:38:23 Re: Very poor read performance, query independent
Previous Message Charles Nadeau 2017-07-12 07:30:16 Re: Very poor read performance, query independent