From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com> |
Subject: | Re: Eager page freeze criteria clarification |
Date: | 2023-12-09 04:11:54 |
Message-ID: | CAAKRu_b3tpbdRPUPh1Q5h35gXhY=spH2ssNsEsJ9sDfw6=PEAg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> The next step is to devise different heuristics and measure their
> efficacy. IMO, the goal of the algorithm it is to freeze pages in a
> relation such that we drive early unfreezes/freezes -> 0 and pages
> frozen/number of pages of a certain age -> 1.
Attached is a patchset with an adaptive freezing algorithm that works
well. It keeps track of the pages' unmodified duration after having been
set all-visible and uses that to predict how likely a page is to be
modified given its age.
Each time an all-visible page is modified by an update, delete, or tuple
lock, if the page was all-visible for less than target_freeze_duration
(a new guc that specifies the minimum amount of time you would like data
to stay frozen), it is counted as an "early unset" and the duration that
it was unmodified is added into an accumulator. Before each relation is
vacuumed, we calculate the mean and standard deviation of these
durations using the accumulator. This is used to calculate a page LSN
threshold demarcating pages with a < 5% likelihood of being modified
before target_freeze_duration. We don't freeze pages younger than this
threshold.
I've done away with the ring buffer of vacuums and the idea of
attributing an "unset" to the vacuum that set it all visible. Instead,
using an "accumulator", I keep a running sum of the page ages, along
with the cardinality of the accumulated set and the sum squared of page
ages. This data structure allows us to extract the mean and standard
deviation, at any time, from an arbitrary number of values in constant
space; and is used to model the pattern of these unsets as a normal
distribution that we can use to try and predict whether or not a page
will be modified.
Values can be "removed" from the accumulator by simply decrementing its
cardinality and decreasing the sum and sum squared by a value that will
not change the mean and standard deviation of the overall distribution.
To adapt to a table's changing access patterns, we'll need to remove
values from this accumulator over time, but this patch doesn't yet
decide when to do this. A simple solution may be to cap the cardinality
of the accumulator to the greater of 1% of the table size, or some fixed
number of values (perhaps 200?). Even without such removal of values,
the distribution recorded in the accumulator will eventually skew toward
more recent data, albeit at a slower rate.
This algorithm is able to predict when pages will be modified before
target_freeze_threshold as long as their modification pattern fits a
normal distribution -- this accommodates access patterns where, after
some period of time, pages become less likely to be modified the older
they are. As an example, however, access patterns where modification
times are bimodal aren't handled well by this model (imagine that half
your modifications are to very new data and the other half are to data
that is much older but still younger than target_freeze_duration). If
this is a common access pattern, the normal distribution could easily be
swapped out for another distribution. The current accumulator is capable
of tracking a distribution's first two moments of central tendency (the
mean and standard deviation). We could track more if we wanted to use a
fancier distribution.
We also must consider what to do when we have few unsets, e.g. with an
insert-only workload. When there are very few unsets (I chose 30 which
the internet says is the approximate minimum number required for the
central limit theorem to hold), we can choose to always freeze freezable
pages; above this limit, the calculated page threshold is used. However,
we may not want to make predictions based on 31 values either. To avoid
this "cliff", we could modify the algorithm a bit to skew the mean and
standard deviation of the distribution using a confidence interval based
on the number of values we've collected.
The goal is to keep pages frozen for at least target_freeze_duration.
target_freeze_duration is in seconds and pages only have a last
modification LSN, so target_freeze_duration must be converted to LSNs.
To accomplish this, I've added an LSNTimeline data structure, containing
XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
age. When we need to translate the guc value to LSNs, we linearly
interpolate it on this timeline. For the time being, the global
LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
is no reason it can't be updated with some other cadence and/or by some
other process (nothing about it is inherently tied to vacuum). The
cached translated value of target_freeze_duration is stored in each
table's stats. This is arbitrary as it is not a table-level stat.
However, it needs to be located somewhere that is accessible on
update/delete. We may want to recalculate it more often than once per
table vacuum, especially in case of long-running vacuums.
To benchmark this new heuristic (I'm calling it algo 6 since it is the
6th I've proposed on this thread), I've used a modified subset of my
original workloads:
Workloads
C. Shifting hot set
32 clients inserting multiple rows and then updating an indexed
column of a row on a different page containing data they formerly
inserted. Only recent data is updated.
H. Append only table
32 clients, each inserting a single row at a time
I. Work queue
32 clients, each inserting a row, then updating an indexed column in
2 different rows in nearby pages, then deleting 1 of the updated rows
Workload C:
Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
-----|----------|-----|--------------|--------------|---------
6 | hot_tail | 14 | 2,111,824 | 1,643,217 | 53.4%
M | hot_tail | 16 | 241,605 | 3,921 | 0.1%
Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS
-----|--------|------------------|------------|------------|--------
6 | 193 | 5,473,949 | 12,793,574 | 14,870 | 28,397
M | 207 | 5,213,763 | 20,362,315 | 46,190 | 28,461
Algorithm 6 freezes all of the cold data and doesn't freeze the current
working set. The notable thing is how much this reduces overall system
I/O. On master, autovacuum is doing more than 3x the I/O and the rest of
the system is doing more than 1.5x the I/O. I suspect freezing data when
it is initially vacuumed is saving future vacuums from having to evict
pages of the working set and read in cold data.
Workload H:
Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen
-----|-----------|-----|--------------|--------------|---------
6 | hthistory | 22 | 668,448 | 668,003 | 87%
M | hthistory | 22 | 0 | 0 | 0%
Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS
-----|--------|------------------|-----------|------------|--------
6 | 14 | 725,103 | 725,575 | 1 | 43,535
M | 13 | 693,945 | 694,417 | 1 | 43,522
The insert-only table is mostly frozen at the end. There is more I/O
done but not at the expense of TPS. This is exactly what we want.
Workload I:
Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
-----|-----------|-----|--------------|--------------|---------
6 | workqueue | 234 | 0 | 4,416 | 78%
M | workqueue | 234 | 0 | 4,799 | 87%
Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time | TPS
-----|--------|------------------|-----------|------------|--------
6 | 74 | 64,345 | 64,813 | 1 | 36,366
M | 73 | 63,468 | 63,936 | 1 | 36,145
What we want is for the work queue table to freeze as little as
possible, because we will be constantly modifying the data. Both on
master and with algorithm 6 we do not freeze tuples on any pages. You
will notice, however, that much of the table is set frozen in the VM at
the end. This is because we set pages all frozen in the VM if they are
technically all frozen even if we do not freeze tuples on the page. This
is inexpensive and not under the control of the freeze heuristic.
Overall, the behavior of this new adaptive freezing method seems to be
exactly what we want. The next step is to decide how many values to
remove from the accumulator and benchmark cases where old data is
deleted.
I'd be delighted to receive any feedback, ideas, questions, or review.
- Melanie
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Record-LSN-at-postmaster-startup.patch | text/x-patch | 2.3 KB |
v2-0005-Add-guc-target_freeze_duration.patch | text/x-patch | 2.8 KB |
v2-0003-visibilitymap_set-clear-return-previous-vm-bits.patch | text/x-patch | 6.6 KB |
v2-0002-WIP-Add-LSNTimeline-to-estimate-LSN-consumption-r.patch | text/x-patch | 6.2 KB |
v2-0004-Add-accumulator-to-calculate-normal-dist-online.patch | text/x-patch | 2.6 KB |
v2-0006-Count-table-modification-VM-clears.patch | text/x-patch | 19.5 KB |
v2-0007-Opportunistically-freeze-pages-unlikely-to-be-mod.patch | text/x-patch | 12.6 KB |
v2-0008-Track-VM-sets-by-vacuum.patch | text/x-patch | 8.6 KB |
v2-0010-Add-pg_visibility_map_summary_extended.patch | text/x-patch | 7.2 KB |
v2-0009-Add-VM-set-and-unset-stats-to-pg_stat_all_tables.patch | text/x-patch | 11.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2023-12-09 05:05:34 | Re: remaining sql/json patches |
Previous Message | Alexander Lakhin | 2023-12-09 04:00:01 | How abnormal server shutdown could be detected by tests? |