Re: postgresql latency & bgwriter not doing its job

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-30 17:19:59
Message-ID: 20140830171959.GA2327@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp)
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
>
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

I've dug up that patch (after a bit of fighting with the archives) and
refreshed it. It's *clearly* beneficial:

master:
andres(at)awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219
...
progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720
progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228
progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546
progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740
progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934
..
progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685
progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342
progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916
progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448
progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339
progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340
progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648
progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297
progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224
progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901
progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576
progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520
progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346
progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670
progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411
progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096
progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690
progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366
progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900
progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774
progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501
progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921
progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426
progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247
progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082
progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577
progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611
progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498
progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970
progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848
progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592
progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851
progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976
progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057
progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994
progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553
progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508
progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053
...
progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210
progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411
progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650
progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596
progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174
progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444
progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611
progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121
progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377
progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927
progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472
progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491
progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754
progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857
progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488
progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162
progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952
progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539
progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094
progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200
progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 552.133
progress: 165.5 s, 9950.2 tps, lat 1424.984 ms stddev 240.249
progress: 166.4 s, 137.3 tps, lat 1056.184 ms stddev 765.356
progress: 167.2 s, 249.7 tps, lat 793.423 ms stddev 792.666
progress: 168.6 s, 194.0 tps, lat 851.712 ms stddev 737.653
progress: 169.2 s, 136.0 tps, lat 1229.394 ms stddev 1227.723
progress: 170.6 s, 92.3 tps, lat 1569.107 ms stddev 1517.027
progress: 171.2 s, 276.5 tps, lat 1325.410 ms stddev 1252.237
progress: 172.0 s, 3881.2 tps, lat 91.122 ms stddev 366.052
progress: 174.4 s, 995.6 tps, lat 120.200 ms stddev 242.005
progress: 174.4 s, 16227.2 tps, lat 651.361 ms stddev 527.946
progress: 175.1 s, 200.0 tps, lat 808.114 ms stddev 542.989
progress: 176.1 s, 179.4 tps, lat 1024.893 ms stddev 724.887
progress: 177.4 s, 367.0 tps, lat 648.519 ms stddev 790.413
progress: 178.4 s, 2236.7 tps, lat 114.064 ms stddev 271.525
progress: 179.0 s, 3643.5 tps, lat 48.404 ms stddev 47.793
progress: 180.0 s, 3309.0 tps, lat 55.598 ms stddev 49.211
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 496443
latency average: 65.245 ms
latency stddev: 134.292 ms
tps = 2756.518300 (including connections establishing)
tps = 2756.718588 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 139127 buffers (21.2%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=53.660 s, sync=6.739 s, total=60.611 s; sync files=18, longest=4.382 s, average=0.374 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 70472 buffers (10.8%); 0 transaction log file(s) added, 0 removed, 70 recycled; write=53.107 s, sync=5.375 s, total=58.605 s; sync files=15, longest=3.243 s, average=0.358 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 80357 buffers (12.3%); 0 transaction log file(s) added, 0 removed, 49 recycled; write=53.526 s, sync=3.896 s, total=57.573 s; sync files=15, longest=1.206 s, average=0.259 s

sorted writes:
starting vacuum...end.
progress: 1.0 s, 3237.2 tps, lat 49.544 ms stddev 42.203
..
progress: 88.0 s, 3030.8 tps, lat 47.438 ms stddev 46.563
progress: 89.3 s, 919.0 tps, lat 204.152 ms stddev 172.776
progress: 90.1 s, 973.7 tps, lat 204.650 ms stddev 197.843
progress: 91.1 s, 875.5 tps, lat 200.472 ms stddev 157.295
progress: 92.0 s, 939.5 tps, lat 185.357 ms stddev 146.362
progress: 93.1 s, 975.8 tps, lat 184.445 ms stddev 167.520
progress: 94.9 s, 673.0 tps, lat 260.900 ms stddev 224.125
progress: 95.0 s, 1057.6 tps, lat 223.717 ms stddev 260.276
progress: 96.0 s, 917.7 tps, lat 201.704 ms stddev 220.395
progress: 97.0 s, 2573.0 tps, lat 78.913 ms stddev 119.458
progress: 98.0 s, 3921.5 tps, lat 44.160 ms stddev 39.147
progress: 99.0 s, 4033.3 tps, lat 43.972 ms stddev 45.921
..
progress: 110.0 s, 4057.8 tps, lat 44.283 ms stddev 47.003
progress: 111.0 s, 897.6 tps, lat 158.250 ms stddev 132.126
progress: 112.0 s, 751.6 tps, lat 249.282 ms stddev 232.964
progress: 113.0 s, 550.6 tps, lat 296.155 ms stddev 226.163
progress: 114.0 s, 1875.3 tps, lat 123.059 ms stddev 210.661
progress: 115.1 s, 4555.4 tps, lat 39.404 ms stddev 32.331
..
progress: 151.0 s, 4240.2 tps, lat 42.158 ms stddev 38.577
progress: 152.1 s, 2551.3 tps, lat 63.584 ms stddev 73.780
progress: 153.1 s, 912.5 tps, lat 189.030 ms stddev 146.658
progress: 154.2 s, 976.2 tps, lat 192.499 ms stddev 158.108
progress: 155.0 s, 1005.1 tps, lat 178.215 ms stddev 129.062
progress: 156.0 s, 1010.1 tps, lat 175.583 ms stddev 138.823
progress: 157.0 s, 1042.0 tps, lat 161.675 ms stddev 135.350
progress: 158.1 s, 766.8 tps, lat 216.367 ms stddev 234.374
progress: 159.5 s, 1025.7 tps, lat 199.366 ms stddev 228.530
progress: 160.0 s, 1178.6 tps, lat 159.156 ms stddev 130.660
progress: 161.0 s, 2904.1 tps, lat 65.751 ms stddev 79.928
progress: 162.0 s, 4189.6 tps, lat 43.492 ms stddev 38.490
..
progress: 170.0 s, 4150.9 tps, lat 41.998 ms stddev 40.931
progress: 171.1 s, 595.5 tps, lat 233.138 ms stddev 188.073
progress: 172.0 s, 622.6 tps, lat 245.023 ms stddev 279.715
progress: 173.0 s, 588.1 tps, lat 359.104 ms stddev 315.367
progress: 174.0 s, 3011.0 tps, lat 74.811 ms stddev 172.972
progress: 175.0 s, 4829.0 tps, lat 37.520 ms stddev 30.578
progress: 176.0 s, 4324.1 tps, lat 41.388 ms stddev 37.727
progress: 177.0 s, 3632.8 tps, lat 49.219 ms stddev 47.092
progress: 178.0 s, 3621.0 tps, lat 50.332 ms stddev 43.086
progress: 179.0 s, 3468.3 tps, lat 50.969 ms stddev 43.025
progress: 180.0 s, 3710.7 tps, lat 48.782 ms stddev 49.070
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 634014
latency average: 51.089 ms
latency stddev: 63.864 ms
number of transactions above the 100.0 ms latency limit: 66921
tps = 3520.574003 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 135453 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=53.419 s, sync=3.665 s, total=57.390 s; sync files=17, longest=1.937 s, average=0.215 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 96907 buffers (14.8%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.147 s, sync=3.369 s, total=56.633 s; sync files=15, longest=1.920 s, average=0.224 s

sorted writes + flush in SyncOneBuffer():

andres(at)awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 3044.7 tps, lat 49.030 ms stddev 45.608
progress: 2.0 s, 3701.3 tps, lat 48.231 ms stddev 44.762
...
progress: 55.0 s, 4157.1 tps, lat 43.216 ms stddev 40.821
progress: 56.0 s, 4736.3 tps, lat 38.326 ms stddev 35.699
progress: 57.0 s, 2604.9 tps, lat 66.072 ms stddev 65.534
progress: 58.0 s, 1659.8 tps, lat 113.848 ms stddev 121.625
progress: 59.0 s, 2625.2 tps, lat 66.112 ms stddev 54.349
..
progress: 106.0 s, 2571.8 tps, lat 72.806 ms stddev 83.308
progress: 107.0 s, 1818.7 tps, lat 93.315 ms stddev 103.284
progress: 108.1 s, 1876.6 tps, lat 97.447 ms stddev 103.151
progress: 109.1 s, 1768.8 tps, lat 110.406 ms stddev 121.347
progress: 110.0 s, 3129.4 tps, lat 56.100 ms stddev 54.835
progress: 111.0 s, 4402.0 tps, lat 40.581 ms stddev 42.269
progress: 112.0 s, 4379.3 tps, lat 42.625 ms stddev 39.507
progress: 113.0 s, 4551.6 tps, lat 39.303 ms stddev 30.563
progress: 114.0 s, 4383.3 tps, lat 41.169 ms stddev 37.459
progress: 115.1 s, 4652.3 tps, lat 38.769 ms stddev 33.415
progress: 116.1 s, 4015.5 tps, lat 41.939 ms stddev 34.122
progress: 117.1 s, 2159.0 tps, lat 84.653 ms stddev 81.445
progress: 118.0 s, 2971.3 tps, lat 62.874 ms stddev 62.229
progress: 119.0 s, 2825.1 tps, lat 60.875 ms stddev 54.767
progress: 120.0 s, 2542.9 tps, lat 75.276 ms stddev 68.709
progress: 121.1 s, 2650.4 tps, lat 63.833 ms stddev 56.419
progress: 122.0 s, 2766.8 tps, lat 68.196 ms stddev 67.042
..
progress: 166.0 s, 2412.4 tps, lat 71.307 ms stddev 74.961
progress: 167.1 s, 1742.5 tps, lat 99.519 ms stddev 102.500
progress: 168.1 s, 1324.2 tps, lat 122.625 ms stddev 135.177
progress: 169.0 s, 2002.5 tps, lat 108.435 ms stddev 125.576
progress: 170.0 s, 2796.6 tps, lat 64.867 ms stddev 64.665
progress: 171.0 s, 3969.1 tps, lat 45.520 ms stddev 41.656
...
progress: 180.0 s, 3003.9 tps, lat 59.134 ms stddev 51.508
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 584725
latency average: 55.375 ms
latency stddev: 56.202 ms
number of transactions above the 100.0 ms latency limit: 76644
tps = 3246.299654 (including connections establishing)
tps = 3246.607040 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 135474 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.260 s, sync=0.079 s, total=53.577 s; sync files=19, longest=0.032 s, average=0.004 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 92770 buffers (14.2%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.677 s, sync=0.239 s, total=54.026 s; sync files=16, longest=0.093 s, average=0.014 s

So, *very* clearly sorting is a benefit. Both the dips in performance
are lower and the aggregate throughput is higher.

Additionally flushing buffers during the flush is a clear win from the
perspective of reliable performance, but noticeable
performancewise. Although much less than without sorting.

Now, I've used a absurdly small checkpoint_timeout here to have the time
to run these benchmarks. But I'm damn sure this similar with real
settings.

So, what I think we need to do is:

1) Sort writes on checkpoints. It's a clear win. Personally I'm
perfectly happy waving away the concerns about OOM. It's not that much
memory:
# SELECT pg_size_pretty(24 * ((64::int8 * 1024 * 1024 * 1024) /
8192::int8));
┌────────────────┐
│ pg_size_pretty │
├────────────────┤
│ 192 MB │
└────────────────┘

If you can aford 64GB shared buffers, 192MB memory allocated in the
checkpointer hardly is a problem. (24 is sizeof(BufferTag) + sizeof(bufid))

2) At least optionally flush dirty buffers during the buffer sync
scan. The improvements in steady performance are too big to ignore. I
think we might be able to make it an unconditional win by flushing in
batches instead of doing so for each buffer like I've done.

3) Possibly do the mdsync() after & for each filenode (or segment?) like
Heikki basically just prototyped. That'll reduce the number of dirty
buffers our fsync() will flush out unnecessarily (which will create
stalls because they'll sit in the kernel's buffer cache).

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.

I think that'll perform horribly on large shared buffers with many
relations. Not a uncommon scenario...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-30 17:24:30 Re: Built-in binning functions
Previous Message Ali Akbar 2014-08-30 16:43:38 Re: [PATCH] empty xml values