From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: patch to allow disable of WAL recycling |
Date: | 2018-08-31 22:49:26 |
Message-ID: | 42897d2d-cc6d-5f9f-6db0-6acb54582be1@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 08/27/2018 03:59 AM, Thomas Munro wrote:
> On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>> zfs (Linux)
>> -----------
>> On scale 200, there's pretty much no difference.
>
> Speculation: It could be that the dnode and/or indirect blocks that
> point to data blocks are falling out of memory in my test setup[1] but
> not in yours. I don't know, but I guess those blocks compete with
> regular data blocks in the ARC? If so it might come down to ARC size
> and the amount of other data churning through it.
>
Not sure, but I'd expect this to matter on the largest scale. The
machine has 64GB of RAM, and scale 8000 is ~120GB with mostly random
access. I've repeated the tests with scale 6000 to give ZFS a bit more
free space and prevent the issues when there's less than 20% of free
space (results later), but I still don't see any massive improvement.
> Further speculation: Other filesystems have equivalent data structures,
> but for example XFS jams that data into the inode itself in a compact
> "extent list" format[2] if it can, to avoid the need for an external
> btree. Hmm, I wonder if that format tends to be used for our segment
> files. Since cached inodes are reclaimed in a different way than cached
> data pages, I wonder if that makes them more sticky in the face of high
> data churn rates (or I guess less, depending on your Linux
> vfs_cache_pressure setting and number of active files). I suppose the
> combination of those two things, sticky inodes with internalised extent
> lists, might make it more likely that we can overwrite an old file
> without having to fault anything in.
>
That's possible. The question is how it affects in which cases it's
worth disabling the WAL reuse, and why you observe better performance
and I don't.
> One big difference between your test rig and mine is that your Optane
> 900P claims to do about half a million random IOPS. That is about half
> a million more IOPS than my spinning disks. (Actually I used my 5400RPM
> steam powered machine deliberately for that test: I disabled fsync so
> that commit rate wouldn't be slowed down but cache misses would be
> obvious. I guess Joyent's storage is somewhere between these two
> extremes...)
>
Yeah. It seems very much like a CPU vs. I/O trade-off, where disabling
the WAL reuse saves a bit of I/O but increases the CPU cost. On the SSD
the reduced number of I/O requests are not noticeable, but the extra CPU
costs does matter (thanks to the high tps values). On slower devices the
I/O savings will matter more, probably.
>> On scale 2000, the
>> throughput actually decreased a bit, by about 5% - from the chart it
>> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
>> for some reason.
>
> Huh.
>
Not sure what's causing this. On SATA results it's not visible, though.
>> I have no idea what happened at the largest scale (8000) - on master
>> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
>> minutes (but not fully). Without WAL reuse there's no such drop,
>> although there seems to be some degradation after ~220 minutes (i.e. at
>> about the same time the master partially recovers. I'm not sure what to
>> think about this, I wonder if it might be caused by almost filling the
>> disk space, or something like that. I'm rerunning this with scale 600.
>
> There are lots of reports of ZFS performance degrading when free space
> gets below something like 20%.
>
I've repeated the benchmarks on the Optane SSD with the largest scale
reduced to 6000, to see if it prevents the performance drop with less
than 20% of free space. It apparently does (see zfs2.pdf), although it
does not change the behavior - with WAL reuse disabled it's still a bit
slower.
I've also done the tests with SATA devices (3x 7.2k drives), to see if
it changes the behavior due to I/O vs. CPU trade-off. And it seems to be
the case (see zfs-sata.pdf), to some extent. For the smallest scale
(200) there's not much difference. For medium (2000) there seems to be a
clear improvement, although the behavior is not particularly smooth. On
the largest scale (8000) there seems to be a slight improvement, or at
least it's not slower like before.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
zfs-sata.pdf | application/pdf | 27.5 KB |
zfs2.pdf | application/pdf | 27.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2018-08-31 23:45:52 | Re: Some pgq table rewrite incompatibility with logical decoding? |
Previous Message | Michael Paquier | 2018-08-31 22:21:19 | Re: pg_verify_checksums and -fno-strict-aliasing |