From: | Dragan Milivojević <galileo(at)pkm-inc(dot)com> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Is NVMe RAID useless (performance-wise) with PostgreSQL? |
Date: | 2024-06-13 00:54:29 |
Message-ID: | 514b55e0-342c-4ec9-b6b4-544af0415dd6@pkm-inc.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi all,
While building a new PostgreSQL server, I realized that the performance with a single disk
is the same or better than with a RAID0 4-disk array.
All benchmarks were conducted using pgbench with a scaling factor of 2000.
For a typical run with pgbench -j 4 -c 512 -P 60 -r -T 300 -b tpcb-like, these are the results:
single disk run:
latency average = 17.524 ms
latency stddev = 6.904 ms
tps = 28870
iostat:
avg-cpu: %user %nice %system %iowait %steal %idle
26.10 0.00 21.02 2.95 0.00 49.93
| Device | r/s | rMB/s | rrqm/s | %rrqm | r_await | rareq-sz | w/s | wMB/s | wrqm/s | %wrqm | w_await | wareq-sz | f/s | f_await | aqu-sz | %util |
|---------|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:|
| nvme0n1 | 28641.27 | 255.00 | 0.00 | 0.00 | 0.16 | 9.12 | 27665.67 | 458.09 | 0.00 | 0.00 | 0.09 | 16.96 | 251.47 | 1.69 | 7.69 | 98.08 |
RAID0 4 disk, 4K chunk:
latency average = 22.269 ms
latency stddev = 10.825 ms
tps = 22742
avg-cpu: %user %nice %system %iowait %steal %idle
23.63 0.00 19.63 1.53 0.00 55.21
| Device | r/s | rMB/s | rrqm/s | %rrqm | r_await | rareq-sz | w/s | wMB/s | wrqm/s | %wrqm | w_await | wareq-sz | f/s | f_await | aqu-sz | %util |
|---------|:--------:|:------:|:-------:|:-----:|:-------:|:--------:|:---------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:|
| md127 | 55359.93 | 216.25 | 0.00 | 0.00 | 0.09 | 4.00 | 105629.07 | 412.61 | 0.00 | 0.00 | 0.04 | 4.00 | 0.00 | 0.00 | 9.02 | 93.76 |
| nvme1n1 | 12763.33 | 54.03 | 1067.47 | 7.72 | 0.08 | 4.33 | 26572.07 | 103.31 | 37.33 | 0.14 | 0.05 | 3.98 | 162.53 | 1.74 | 2.67 | 99.18 |
| nvme3n1 | 12753.07 | 53.97 | 1063.87 | 7.70 | 0.08 | 4.33 | 26560.47 | 103.26 | 37.40 | 0.14 | 0.05 | 3.98 | 162.47 | 1.73 | 2.58 | 99.15 |
| nvme4n1 | 12787.27 | 54.10 | 1062.80 | 7.67 | 0.09 | 4.33 | 26492.73 | 102.99 | 35.67 | 0.13 | 0.05 | 3.98 | 162.53 | 1.69 | 2.67 | 99.07 |
| nvme5n1 | 12796.53 | 54.15 | 1065.60 | 7.69 | 0.09 | 4.33 | 26505.67 | 103.04 | 35.73 | 0.13 | 0.05 | 3.98 | 162.53 | 1.66 | 2.56 | 98.95 |
BTW, if these tables are mangled in transport or by email clients, I posted this email to https://pastebin.com/raw/ZmsH0T5M.
A 4K chunk is obviously not optimal, but I should still be getting around a 2x uplift.
In the past, when tuning PostgreSQL, I tweaked various RAID parameters like chunk size, stripe_cache_size, etc.
but it never occurred to me to check the performance against a single drive. Hence this email.
I'm not sure if this is expected or if there is something wrong with my setup.
Full system details are at the end of the message.
While exploring this, I went deep down the rabbit hole, running hundreds of tests and trying dozens of configurations.
The best I achieved was 31K TPS with a plain RAID0 256KB chunk, nvme poll_queues=4 and io_pool=1.
This resulted in a measly 2% improvement compared to a single disk.
A sample of results:
Single disk nvme poll_queues=0
pgbench -j4 -c X tps avg latency ms latency stddev ms
1 477 2.096 0.258
4 1167 3.426 0.295
16 4408 3.623 0.545
64 12533 5.089 0.999
128 21295 5.979 1.538
256 28022 9.048 3.014
512 28870 17.524 6.904
Single disk nvme poll_queues=4
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 22284 5.711 1.448
256 27390 9.240 2.848
512 30596 16.452 6.090
1024 26352 38.481 19.513
4 disk RAID 0 4KB Chunk nvme poll_queues=0
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 17614 7.231 2.43
256 22347 11.37 4.922
512 22742 22.269 10.825
1024 20896 48.57 26.975
4 disk LVM RAID 0 4KB Chunk nvme poll_queues=4
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 17423 7.312 1.991
256 22064 11.521 4.044
512 24875 20.373 9.421
1024 21242 47.692 25.843
4 disk RAID 0 8KB Chunk nvme poll_queues=4
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 18907 6.736 2.094
256 24909 10.184 3.446
512 24878 20.331 8.448
1024 20535 49.665 27.462
4 disk RAID 0 64KB Chunk nvme poll_queues=4
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 21393 5.951 2.102
256 27231 9.293 2.752
512 30261 16.624 5.986
1024 25245 40.285 21.317
4 disk RAID 0 256KB Chunk nvme poll_queues=4
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 21055 6.046 1.972
256 27226 9.289 2.736
512 31174 16.120 5.858
1024 27069 37.355 17.797
4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=0 stripe_cache_size=256
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 15368 8.291 3.527
256 19110 13.312 6.151
512 19796 25.667 13.191
4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=4 stripe_cache_size=256
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 17143 7.431 2.353
256 21777 11.665 4.175
512 22602 22.433 9.239
1024 20291 50.189 26.191
4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=256
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 16767 7.598 2.698
256 21646 11.733 4.288
512 22161 22.879 9.811
4 disk RAID 5 256KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=4096
pgbench -j4 -c X tps avg latency ms latency stddev ms
128 15997 7.966 2.754
256
512 21745 23.293 9.805
The obvious conclusion is that there is something broken with Linux MD RAID, so I examined
whether RAID0 scales with the simplest workloads, such as sequential reads.
Unfortunately, I did not reach a definite conclusion.
I posted my tests to the Linux RAID list but received no response (https://lore.kernel.org/linux-raid/1af8f1e0-4f41-4f25-bc34-f655a4c141b4(at)pkm-inc(dot)com/T/#u)
So my question is this: has anyone else tested whether MD RAID scales with PostgreSQL, and what results did you get?
What is your typical setup when you have a bunch of NVMe drives in a shiny new server?
System specs:
Dell PowerEdge R7525, Dual AMD EPYC 7313, 32G DDR4 3200
Disks used for tests: "Samsung SSD 980 PRO with Heatsink 2TB" drives
All drives under test are connected to the same processor.
AlmaLinux release 9.4 (Seafoam Ocelot)
Kernel 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC
tuned-adm profile postgresql
tweaks: poll_queues=4, io_poll = 1
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), 64-bit
(1 row)
name | current_setting | source
------------------------------+--------------------+--------------------
application_name | psql | client
checkpoint_completion_target | 0.9 | configuration file
client_encoding | UTF8 | client
DateStyle | ISO, MDY | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 24GB | configuration file
effective_io_concurrency | 200 | configuration file
lc_messages | C.UTF-8 | configuration file
lc_monetary | C.UTF-8 | configuration file
lc_numeric | C.UTF-8 | configuration file
lc_time | C.UTF-8 | configuration file
log_filename | postgresql-%a.log | configuration file
log_rotation_age | 1d | configuration file
log_rotation_size | 0 | configuration file
log_timezone | Europe/Belgrade | configuration file
log_truncate_on_rotation | on | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 2GB | configuration file
max_connections | 1024 | configuration file
max_wal_size | 8GB | configuration file
min_wal_size | 2GB | configuration file
random_page_cost | 1.1 | configuration file
shared_buffers | 8GB | configuration file
TimeZone | Europe/Belgrade | configuration file
wal_buffers | 16MB | configuration file
work_mem | 4MB | configuration file
Thanks
Dragan
From | Date | Subject | |
---|---|---|---|
Next Message | Chema | 2024-06-13 01:31:32 | "permission denied to COPY to or from an external program" even with GRANT pg_execute_server_program |
Previous Message | Adrian Klaver | 2024-06-12 23:38:54 | Re: Definging columns for INSERT statements |