Re: Performance Improvement by reducing WAL for Update Operation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Hari Babu <haribabu(dot)kommi(at)huawei(dot)com>, Mike Blackwell <mike(dot)blackwell(at)rrd(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Date: 2013-11-26 02:55:58
Message-ID: CA+Tgmoa-b7oHNMyngY=0EoTAXbDZ+JYd437cBqnUFTLuZqYCQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On the Mac, the only case that seems to have a slowdown now is "hundred tiny
> fields, half nulled". It would be nice to understand just what is going on
> with that one. I got some ugly results in "two short fields, no change"
> too, along with a couple of other weird results, but I think those were
> testing procedure issues that can be ignored. The pgbench throttle work I
> did recently highlights that I can't really make a Mac quiet/consistent for
> benchmarking very well. Note that I ran all of the Mac tests with
> assertions on, to try and catch platform specific bugs. The Linux ones used
> the default build parameters.

Amit has been asking me to look at this patch for a while, so I
finally did. While I agree that it would be nice to get the CPU
overhead down low enough that we can turn this on by default and
forget about it, I'm not convinced that it's without value even if we
can't. Fundamentally, this patch trades away some CPU in exchanged
for decrease I/O. The testing thus far is all about whether the CPU
overhead can be made trivial, which is a good question to ask, because
if the answer is yes, then rather than trading something for something
else, we just get something for free. Win! But even if that doesn't
pan out, I think the fallback position should not be "OK, well, if we
can't get decreased I/O for free then forget it" but rather "OK, if we
can't get decreased I/O for free then let's get decreased I/O in
exchange for increased CPU usage".

I spent a little time running the tests from Heikki's script under
perf. On all three "two short fields" tests and also on the "ten tiny
fields, all changed" test, we spend about 1% of the CPU time in
pglz_delta_encode. I don't see any evidence that it's actually
compressing anything at all; it appears to be falling out where we
test the input length against the strategy, presumably because the
default strategy (which we largely copy here) doesn't try to compress
input data of less than 32 bytes. Given that this code isn't
actually compressing anything in these cases, I'm a bit confused by
Greg's report of substantial gains on "ten tiny fields, all changed"
test; how can we win if we're not compressing?

I studied the "hundred tiny fields, half nulled" test case in some
detail. Some thoughts:

- There is a comment "TODO: It would be nice to behave like the
history and the source strings were concatenated, so that you could
compress using the new data, too." If we're not already doing that,
then how are we managing to compress WAL by more than one-quarter in
the "hundred tiny fields, all changed" case? It looks to me like the
patch IS doing that, and I'm not sure it's a good idea, especially
because it's using pglz_hist_add_no_recycle rather than pglz_add_hist:
we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't
seem good enough. On the "hundred tiny fields, half nulled" test case,
removing that line reduces compression somewhat but also saves on CPU
cycles.

- pglz_find_match() is happy to walk all the way down even a really,
really long bucket chain. It has some code that reduces good_match
each time through, but it fails to produce a non-zero decrement once
good_match * good_drop < 100. So if we're searching an enormously
deep bucket many times in a row, and there are actually no matches,
we'll go flying down the whole linked list every time. I tried
mandating a minimum decrease of 1 and didn't observe that it made any
difference in the run time of this test case, but it still seems odd.
For the record, it's not new behavior with this patch; pglz_compress()
has the same issue as things stand today. I wonder if we ought to
decrease the good match length by a constant rather than a percentage
at each step.

- pglz_delta_encode() seems to spend about 50% of its CPU time loading
up the history data. It would be nice to find a way to reduce that
effort. I thought about maybe only making a history entry for, say,
every fourth input position rather than every one, but a quick test
seems to suggest that's a big fail, probably because it's too easy to
skip over the position where we would have made "the right" match via
some short match. But maybe there's some more sophisticated strategy
here that would work better. For example, see:

http://en.wikipedia.org/wiki/Rabin_fingerprint

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size. So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk. We enter all the chunks into a hash table. The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5. Then, we scan the input we want to compress and
divide it into chunks in the same way. Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I'm not 100% certain that something like this would be better than
trying to leverage the pglz machinery, but I think it might be worth
trying. One possible advantage is that you make many fewer hash-table
entries, which reduces both the cost of setting up the hash table and
the cost of probing it; another is that if you find a hit in the hash
table, you needn't search any further: you are done; this is related
to the point that, for the most part, the processing is
chunk-at-a-time rather than character-at-a-time, which might be more
efficient. On the other hand, the compression ratio might stink, or
it might suck for some other reason: I just don't know.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-11-26 03:04:19 Re: Suggestion: Issue warning when calling SET TRANSACTION outside transaction block
Previous Message Peter Geoghegan 2013-11-26 02:51:42 Re: autovacuum_work_mem