Re: Performance Improvement by reducing WAL for Update Operation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Gregory <gregoryv(at)zils(dot)esselgroup(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mike Blackwell <mike(dot)blackwell(at)rrd(dot)com>
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Date: 2014-01-10 15:42:47
Message-ID: CA+TgmoaqdwNSd9KcF01u6Y4c8rSgQ5riu5tPSmq5Ya6mTSpUOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> 2. Provide a new reloption to specify Wal compression
> for update operation on table
> Create table tbl(c1 char(100)) With (compress_wal = true);
>
> Alternative options:
> a. compress_wal can take input as operation, e.g. 'insert', 'update',
> b. use alternate syntax:
> Create table tbl(c1 char(100)) Compress Wal For Update;
> c. anything better?

I think WITH (compress_wal = true) is pretty good. I don't understand
your point about taking the operation as input, because this only
applies to updates. But we could try to work "update" into the name
of the setting somehow, so as to be less likely to conflict with
future additions, like maybe wal_compress_update. I think the
alternate syntax you propose is clearly worse, because it would
involve adding new keywords, something we try to avoid.

The only possible enhancement I can think of here is to make the
setting an integer rather than a Boolean, defined as the minimum
acceptable compression ratio. A setting of 0 means always compress; a
setting of 100 means never compress; intermediate values define the
least acceptable ratio. But to be honest, I think that's overkill;
I'd be inclined to hard-code the default value of 25 in the patch and
make it a #define. The only real advantage of requiring a minimum 25%
compression percentage is that we can bail out on compression
three-quarters of the way through the tuple if we're getting nowhere.
That's fine for what it is, but the idea that users are going to see
much benefit from twaddling that number seems very dubious to me.

> Points to consider
> -----------------------------
>
> 1. As the current algorithm store the entry for same chunks at head of list,
> it will always find last but one chunk (we don't store last 4 bytes) for
> long matching string during match phase in encoding (pgrb_delta_encode).
>
> We can improve it either by storing same chunks at end of list instead of at
> head or by trying to find a good_match technique used in lz algorithm.
> Finding good_match technique can have overhead in some of the cases
> when there is actually no match.

I don't see what the good_match thing has to do with anything in the
Rabin algorithm. But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk. After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input. If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1? And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2? Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky. I
would expect us to include all of those bytes in the chunk.

> 2. Another optimization that we can do in pgrb_find_match(), is that
> currently if
> it doesn't find the first chunk (chunk got by hash index) matching, it
> continues to find the match in other chunks. I am not sure if there is any
> benefit to search for other chunks if first one is not matching.

Well, if you took that out, I suspect it would hurt the compression
ratio. Unless the CPU savings are substantial, I'd leave it alone.

> 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
> if we want to move, then we need to either duplicate some common macros
> like pglz_out_tag or keep it common, but might be change the name.

+1 for a new file.

> 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
> The point to consider is that if we keep bigger chunk sizes, then it can
> save us on CPU cycles, but less reduction in Wal, on the other side if we
> keep it small it can have better reduction in Wal but consume more CPU
> cycles.

Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4,
the average length of a chunk is about 16 bytes. It makes little
sense to have the maximum chunk size be 25% of the expected chunk
length. I'd recommend making the maximum chunk length something like
4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
4.

> 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
> can remove it before commit.

Let's remove it now.

> 7. docs needs to be updated, tab completion needs some work.

Tab completion can be skipped for now, but documentation is important.

> 8. We can extend Alter Table to set compress option for table.

I don't understand what you have in mind here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-01-10 15:47:25 Re: Standalone synchronous master
Previous Message Simon Riggs 2014-01-10 15:36:08 Re: Add CREATE support to event triggers