Re: Clock sweep not caching enough B-Tree leaf pages?

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?
Date: 2014-04-22 06:57:04
Message-ID: CAM3SWZTCNobsNvEyiipT1u2xWrOOt-TMRzqyPV=s-x6MG46KJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is a benchmark that is similar to my earlier one, but with a rate
limit of 125 tps, to help us better characterize how the prototype
patch helps performance:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/

Again, these are 15 minute runs with unlogged tables at multiple
client counts, and scale 5,000.

Every test run should have managed to hit that limit, but in the case
of master 1 test did not. I have included vmstat, iostat and meminfo
OS instrumentation this time around, which is really interesting for
this particular limit based benchmark. The prototype patch tested here
is a slight refinement on my earlier prototype. Apart from reducing
the number of gettimeofday() calls, and doing them out of the critical
path, I increased the initial usage_count value to 6. I also set
BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to
tweak things, which I actually did very little of prior to publishing
my initial results. This helps, but there isn't a huge additional
benefit.

The benchmark results show that master cannot even meet the 125 tps
limit on 1 test out of 9. More interestingly, the background writer
consistently cleans about 10,000 buffers per test run when testing the
patch. At the same time, buffers are allocated at a very consistent
rate of around 262,000 for the patched test runs. Leaving aside the
first test run, with the patch there is only a tiny variance in the
number cleaned between each test, a variance of just a few hundred
buffers. In contrast, master has enormous variance. During just over
half of the tests, the background writer does not clean even a single
buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000
buffers. The second time this happens leads to master failing to even
meet the 125 tps limit (albeit with only one client).

If you drill down to individual test runs, a similar pattern is
evident. You'll now find operating system information (meminfo dirty
memory) graphed here. The majority of the time, master does not hold
more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB
for multiple minutes. However, during the test runs where we also see
huge spikes in background-writer-cleaned pages, we also see huge
spikes in the total amount of dirty memory (and correlated huge spikes
in latency). It can get to highs of ~700,000 kB at one point. In
contrast, the patched tests show very consistent amounts of dirty
memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB
(there is a single 12,000 kB spike, though). There is consistently a
distinct zig-zag pattern to the dirty memory graph with the patched
tests, regardless of client count or where checkpoints occur. Master
shows mountains and valleys for those two particularly problematic
tests, correlating with a panicked background writer's aggressive
feedback loop. Master also shows less rhythmic zig-zag patterns that
only peak at about 600 kB - 1,000 kB for the entire duration of many
individual test runs.

Perhaps most notably, average and worst case latency is far improved
with the patch. On average it's less than half of master with 1
client, and less than a quarter of master with 32 clients.

I think that the rate limiting feature of pgbench is really useful for
characterizing how work like this improves performance. I see a far
smoother and more consistent pattern of I/O that superficially looks
like Postgres is cooperating with the operating system much more than
it does in the baseline. It sort of makes sense that the operating
system cache doesn't care about frequency while Postgres does. If the
OS cache did weigh frequency, it would surely not alter the outcome
very much, since OS cached data has presumably not been accessed very
frequently recently. I suspect that I've cut down on double buffering
by quite a bit. I would like to come up with a simple way of measuring
that, using something like pgfincore, but the available interfaces
don't seem well-suited to quantifying how much of a problem this is
and remains. I guess call pgfincore on the index segment files might
be interesting, since shared_buffers mostly holds index pages. This
has been verified using pg_buffercache.

It would be great to test this out with something involving
non-uniform distributions, like Gaussian and Zipfian distributions.
The LRU-K paper tests Zipfian too. The uniform distribution pgbench
uses here, while interesting, doesn't tell the full story at all and
is less representative of reality (TPB-C is formally required to have
a non-uniform distribution [1] for some things, for example). A
Gaussian distribution might show essentially the same failure to
properly credit pages with frequency of access in one additional
dimension, so to speak. Someone should probably look at the TPC-C-like
DBT-2, since in the past that was considered to be a problematic
workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot
less sympathetic than uniform if the LRU-K paper is any indication,
and so if we're looking for a worst-case, that's probably a good place
to start. It's not obvious how you'd go about actually constructing a
practical test case for either, though.

I should acknowledge that I clearly have regressed one aspect that is
evident from the benchmark: Cleanup time (which is recorded per test)
takes more than twice as long. This makes intuitive sense, though.
VACUUM first creates a list of tuples to kill by scanning the heap. It
then goes to the indexes to kill them there first. It then returns to
the heap, and kills heap tuples in a final scan. Clearly it is bad for
VACUUM that shared_buffers mostly contains index pages when it begins.
That said, I haven't actually considered the interactions with buffer
access strategies here. It might well be more complicated than I've
suggested.

To be thorough, I've repeated each test set. There is a "do over" for
both master and patched, which serves to show how repeatable the
original test sets are.

[1] Clause 2.1.6, TPC-C specification:
http://www.tpc.org/tpcc/spec/tpcc_current.pdf

[2] https://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting#DBT-2_I.2FO_Performance
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2014-04-22 07:29:08 Re: Clock sweep not caching enough B-Tree leaf pages?
Previous Message Andres Freund 2014-04-22 06:44:10 Re: assertion failure 9.3.4