Re: Write lifetime hints for NVMe

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write lifetime hints for NVMe
Date: 2018-01-27 15:03:55
Message-ID: 30965a3e-5bde-4f70-dc06-1ff297abca4c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/27/2018 02:20 PM, Dmitry Dolgov wrote:
> Hi,
>
> From what I see some time ago the write lifetime hints support for NVMe multi
> streaming was merged into Linux kernel [1]. Theoretically it allows data
> written together on media so they can be erased together, which minimizes
> garbage collection, resulting in reduced write amplification as well as
> efficient flash utilization [2]. I couldn't find any discussion about that on
> hackers, so I decided to experiment with this feature a bit. My idea was to
> test quite naive approach when all file descriptors, that are related to
> temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
> `RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
> infrastructure around to enable/disable hints.
>
> It turns out that it's possible to perform benchmarks on some EC2 instance
> types (e.g. c5) with the corresponding version of the kernel, since they expose
> a volume as nvme device:
>
> ```
> # nvme list
> Node SN Model
> Namespace Usage Format FW Rev
> ---------------- --------------------
> ---------------------------------------- ---------
> -------------------------- ---------------- --------
> /dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store
> 1 0.00 B / 8.59 GB 512 B + 0 B 1.0
> ```
>
> To get some baseline results I've run several rounds of pgbench on these quite
> modest instances (dedicated, with optimized EBS) with slightly adjusted
> `max_wal_size` and with default configuration:
>
> $ pgbench -s 200 -i
> $ pgbench -T 600 -c 2 -j 2
>
> Analyzing `strace` output I can see that during this test there were some
> significant number of operations with pg_stat_tmp and xlogtemp, so I assume
> write lifetime hints should have some effect.
>
> As a result I've got reduction of latency about 5-8% (but so far these numbers
> are unstable, probably because of virtualization).
>
> ```
> # without patch
> number of transactions actually processed: 491945
> latency average = 2.439 ms
> tps = 819.906323 (including connections establishing)
> tps = 819.908755 (excluding connections establishing)
> ```
>
> ```
> with patch
> number of transactions actually processed: 521805
> latency average = 2.300 ms
> tps = 869.665330 (including connections establishing)
> tps = 869.668026 (excluding connections establishing)
> ```
>

Aren't those numbers far lower that you'd expect from NVMe storage? I do
have a NVMe drive (Intel 750) in my machine, and I can do thousands of
transactions on it with two clients. Seems a bit suspicious.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-01-27 16:14:53 Re: Add RANGE with values and exclusions clauses to the Window Functions
Previous Message Pavel Stehule 2018-01-27 14:31:43 Re: [HACKERS] proposal: psql command \graw