From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WIP: WAL prefetch (another approach) |
Date: | 2020-06-05 15:20:52 |
Message-ID: | 20200605152052.pe6ql7nxzrpbrit7@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I've spent some time testing this, mostly from the performance point of
view. I've done a very simple thing, in order to have reproducible test:
1) I've initialized pgbench with scale 8000 (so ~120GB on a machine with
only 64GB of RAM)
2) created a physical backup, enabled WAL archiving
3) did 1h pgbench run with 32 clients
4) disabled full-page writes and did another 1h pgbench run
Once I had this, I did a recovery using the physical backup and WAL
archive, measuring how long it took to apply each WAL segment. First
without any prefetching (current master), then twice with prefetching.
First with default values (m_io_c=10, distance=256kB) and then with
higher values (100 + 2MB).
I did this on two storage systems I have in the system - NVME SSD and
SATA RAID (3 x 7.2k drives). So, a fast one and slow one.
1) NVME
On the NVME, this generates ~26k WAL segments (~400GB), and each of the
pgbench runs generates ~120M transactions (~33k tps). Of course, wast
majority of the WAL segments ~16k comes from the first run, because
there's a lot of FPI due to the random nature of the workload.
I have not expected a significant improvement from the prefetching, as
the NVME is pretty good in handling random I/O. The total duration looks
like this:
no prefetch prefetch prefetch2
10618 10385 9403
So the default is a tiny bit faster, and the more aggressive config
makes it about 10% faster. Not bad, considering the expectations.
Attached is a chart comparing the three runs. There are three clearly
visible parts - first the 1h run with f_p_w=on, with two checkpoints.
That's first ~16k segments. Then there's a bit of a gap before the
second pgbench run was started - I think it's mostly autovacuum etc. And
then at segment ~23k the second pgbench (f_p_w=off) starts.
I think this shows the prefetching starts to help as the number of FPIs
decreases. It's subtle, but it's there.
2) SATA
On SATA it's just ~550 segments (~8.5GB), and the pgbench runs generate
only about 1M transactions. Again, vast majority of the segments comes
from the first run, due to FPI.
In this case, I don't have complete results, but after processing 542
segments (out of the ~550) it looks like this:
no prefetch prefetch prefetch2
6644 6635 8282
So the no prefetch and "default" prefetch are roughly on par, but the
"aggressive" prefetch is way slower. I'll get back to this shortly, but
I'd like to point out this is entirely due to the "no FPI" pgbench,
because after the first ~525 initial segments it looks like this:
no prefetch prefetch prefetch2
58 65 57
So it goes very fast by the initial segments with plenty of FPIs, and
then we get to the "no FPI" segments and the prefetch either does not
help or makes it slower.
Looking at how long it takes to apply the last few segments, it looks
like this:
no prefetch prefetch prefetch2
280 298 478
which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:
prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
0000000100000108000000FF, offset 0
prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
000000010000010900000001, offset 0
Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.
Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
image/png | 29.3 KB | |
image/png | 20.4 KB | |
image/png | 22.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2020-06-05 15:38:11 | Re: minor doc fix - garbage in example of result of unnest |
Previous Message | Antonin Houska | 2020-06-05 15:16:43 | Re: More efficient RI checks - take 2 |