Re: patch to allow disable of WAL recycling

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2018-08-31 22:49:26
Message-ID: 42897d2d-cc6d-5f9f-6db0-6acb54582be1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 08/27/2018 03:59 AM, Thomas Munro wrote:
> On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>> zfs (Linux)
>> -----------
>> On scale 200, there's pretty much no difference.
>
> Speculation: It could be that the dnode and/or indirect blocks that
> point to data blocks are falling out of memory in my test setup[1] but
> not in yours.  I don't know, but I guess those blocks compete with
> regular data blocks in the ARC?  If so it might come down to ARC size
> and the amount of other data churning through it.
>

Not sure, but I'd expect this to matter on the largest scale. The
machine has 64GB of RAM, and scale 8000 is ~120GB with mostly random
access. I've repeated the tests with scale 6000 to give ZFS a bit more
free space and prevent the issues when there's less than 20% of free
space (results later), but I still don't see any massive improvement.

> Further speculation:  Other filesystems have equivalent data structures,
> but for example XFS jams that data into the inode itself in a compact
> "extent list" format[2] if it can, to avoid the need for an external
> btree.  Hmm, I wonder if that format tends to be used for our segment
> files.  Since cached inodes are reclaimed in a different way than cached
> data pages, I wonder if that makes them more sticky in the face of high
> data churn rates (or I guess less, depending on your Linux
> vfs_cache_pressure setting and number of active files).  I suppose the
> combination of those two things, sticky inodes with internalised extent
> lists, might make it more likely that we can overwrite an old file
> without having to fault anything in.
>

That's possible. The question is how it affects in which cases it's
worth disabling the WAL reuse, and why you observe better performance
and I don't.

> One big difference between your test rig and mine is that your Optane
> 900P claims to do about half a million random IOPS.  That is about half
> a million more IOPS than my spinning disks.  (Actually I used my 5400RPM
> steam powered machine deliberately for that test: I disabled fsync so
> that commit rate wouldn't be slowed down but cache misses would be
> obvious.  I guess Joyent's storage is somewhere between these two
> extremes...)
>

Yeah. It seems very much like a CPU vs. I/O trade-off, where disabling
the WAL reuse saves a bit of I/O but increases the CPU cost. On the SSD
the reduced number of I/O requests are not noticeable, but the extra CPU
costs does matter (thanks to the high tps values). On slower devices the
I/O savings will matter more, probably.

>> On scale 2000, the
>> throughput actually decreased a bit, by about 5% - from the chart it
>> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
>> for some reason.
>
> Huh.
>

Not sure what's causing this. On SATA results it's not visible, though.

>> I have no idea what happened at the largest scale (8000) - on master
>> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
>> minutes (but not fully). Without WAL reuse there's no such drop,
>> although there seems to be some degradation after ~220 minutes (i.e. at
>> about the same time the master partially recovers. I'm not sure what to
>> think about this, I wonder if it might be caused by almost filling the
>> disk space, or something like that. I'm rerunning this with scale 600.
>
> There are lots of reports of ZFS performance degrading when free space
> gets below something like 20%.
>

I've repeated the benchmarks on the Optane SSD with the largest scale
reduced to 6000, to see if it prevents the performance drop with less
than 20% of free space. It apparently does (see zfs2.pdf), although it
does not change the behavior - with WAL reuse disabled it's still a bit
slower.

I've also done the tests with SATA devices (3x 7.2k drives), to see if
it changes the behavior due to I/O vs. CPU trade-off. And it seems to be
the case (see zfs-sata.pdf), to some extent. For the smallest scale
(200) there's not much difference. For medium (2000) there seems to be a
clear improvement, although the behavior is not particularly smooth. On
the largest scale (8000) there seems to be a slight improvement, or at
least it's not slower like before.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
zfs-sata.pdf application/pdf 27.5 KB
zfs2.pdf application/pdf 27.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-08-31 23:45:52 Re: Some pgq table rewrite incompatibility with logical decoding?
Previous Message Michael Paquier 2018-08-31 22:21:19 Re: pg_verify_checksums and -fno-strict-aliasing