Re: Compress ReorderBuffer spill files using LZ4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Julien Tachoires <julmon(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Compress ReorderBuffer spill files using LZ4
Date: 2024-09-24 09:57:28
Message-ID: bd1c42f5-7167-41eb-a473-b49b77ad9363@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/23/24 21:58, Julien Tachoires wrote:
> Hi Tomas,
>
> Le lun. 23 sept. 2024 à 18:13, Tomas Vondra <tomas(at)vondra(dot)me> a écrit :
>>
>> Hi,
>>
>> I've spent a bit more time on this, mostly running tests to get a better
>> idea of the practical benefits.
>
> Thank you for your code review and testing!
>
>> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
>> for pglz_compress() to return -1. This can happen if the data is not
>> compressible, and would not fit into the output buffer. The code can't
>> just do elog(ERROR) in this case, it needs to handle that by storing the
>> raw data. The attached fixup patch makes this work for me - I'm not
>> claiming this is the best way to handle this, but it works.
>>
>> FWIW I find it strange the tests included in the patch did not trigger
>> this. That probably means the tests are not quite sufficient.
>>
>>
>> Now, to the testing. Attached are two scripts, testing different cases:
>>
>> test-columns.sh - Table with a variable number of 'float8' columns.
>>
>> test-toast.sh - Table with a single text column.
>>
>> The script always sets up a publication/subscription on two instances,
>> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
>> waits for it to be replicated to the replica, and measures how much data
>> was spilled to disk with the different compression methods (off, pglz
>> and lz4). There's a couple more metrics, but that's irrelevant here.
>
> It would be interesting to run the same tests with zstd: in my early
> testing I found that zstd was able to provide a better compression
> ratio than lz4, but seemed to use more CPU resources/is slower.
>

Oh, I completely forgot about zstd. I don't think it'd substantially
change the conclusions, though. It might compress better/worse for some
cases, but the overall behavior would remain the same.

I can't test this right now, the testmachine is busy with some other
stuff. But it should not be difficult to update the test scripts I
attached and get results yourself. There's a couple hard-coded paths
that need to be updated, ofc.

>> For the "column" test, it looks like this (this is in MB):
>>
>> rows columns distribution off pglz lz4
>> ========================================================
>> 100000 1000 compressible 778 20 9
>> random 778 778 16
>> --------------------------------------------------------
>> 1000000 100 compressible 916 116 62
>> random 916 916 67
>>
>> It's very clear that for the "compressible" data (which just copies the
>> same value into all columns), both pglz and lz4 can significantly reduce
>> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
>> columns it's a bit less efficient, but still good.
>>
>> For the "random" data (where every column gets a random value, but rows
>> are copied), it's a very different story - pglz does not help at all,
>> while lz4 still massively reduces the amount of spilled data.
>>
>> I think the explanation is very simple - for pglz, we compress each row
>> on it's own, there's no concept of streaming/context. If a row is
>> compressible, it works fine, but when the row gets random, pglz can't
>> compress it at all. For lz4, this does not matter, because with the
>> streaming mode it still sees that rows are just repeated, and so can
>> compress them efficiently.
>
> That's correct.
>
>> For TOAST test, the results look like this:
>>
>> distribution repeats toast off pglz lz4
>> ===============================================================
>> compressible 10000 lz4 14 2 1
>> pglz 40 4 3
>> 1000 lz4 32 16 9
>> pglz 54 17 10
>> ---------------------------------------------------------
>> random 10000 lz4 3305 3305 3157
>> pglz 3305 3305 3157
>> 1000 lz4 3166 3162 1580
>> pglz 3334 3326 1745
>> ----------------------------------------------------------
>> random2 10000 lz4 3305 3305 3157
>> pglz 3305 3305 3158
>> 1000 lz4 3160 3156 3010
>> pglz 3334 3326 3172
>>
>> The "repeats" value means how long the string is - it's the number of
>> "md5" hashes added to the string. The number of rows is calculated to
>> keep the total amount of data the same. The "toast" column tracks what
>> compression was used for TOAST, I was wondering if it matters.
>>
>> This time there are three data distributions - compressible means that
>> each TOAST value is nicely compressible, "random" means each value is
>> random (not compressible), but the rows are just copy of the same value
>> (so on the whole there's a lot of redundancy). And "random2" means each
>> row is random and unique (so not compressible at all).
>>
>> The table shows that with compressible TOAST values, compressing the
>> spill file is rather useless. The reason is that ReorderBufferCompress
>> is handling raw TOAST data, which is already compressed. Yes, it may
>> further reduce the amount of data, but it's negligible when compared to
>> the original amount of data.
>>
>> For the random cases, the spill compression is rather pointless. Yes,
>> lz4 can reduce it to 1/2 for the shorter strings, but other than that
>> it's not very useful.
>
> It's still interesting to confirm that data already compressed or
> random data cannot be significantly compressed.
>
>> For a while I was thinking this approach is flawed, because it only sees
>> and compressed changes one by one, and that seeing a batch of changes
>> would improve this (e.g. we'd see the copied rows). But I realized lz4
>> already does that (in the streaming mode at least), and yet it does not
>> help very much. Presumably that depends on how large the context is. If
>> the random string is long enough, it won't help.
>>
>> So maybe this approach is fine, and doing the compression at a lower
>> layer (for the whole file), would not really improve this. Even then
>> we'd only see a limited amount of data.
>>
>> Maybe the right answer to this is that compression does not help cases
>> where most of the replicated data is TOAST, and that it can help cases
>> with wide (and redundant) rows, or repeated rows. And that lz4 is a
>> clearly superior choice. (This also raises the question if we want to
>> support REORDER_BUFFER_STRAT_LZ4_REGULAR. I haven't looked into this,
>> but doesn't that behave more like pglz, i.e. no context?)
>
> I'm working on a new version of this patch set that will include the
> changes you suggested in your review. About using LZ4 regular API, the
> goal was to use it when we cannot use the streaming API due to raw
> data larger than LZ4 ring buffer. But this is something I'm going to
> delete in the new version because I'm planning to use a similar
> approach as we do in astreamer_lz4.c: using frames, not blocks. LZ4
> frame API looks very similar to ZSTD's streaming API.
>
>> FWIW when doing these tests, it made me realize how useful would it be
>> to track both the "raw" and "spilled" amounts. That is before/after
>> compression. It'd make calculating compression ratio much easier.
>
> Yes, that's why I tried to "fix" the spill_bytes counter.
>

But I think the 'fixed' counter only tracks the data after the new
compression, right? I'm suggesting to have two counters - one for "raw"
data (before compression) and "compressed" (after compression).

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-09-24 10:00:04 Re: pgbench: Improve result outputs related to failed transactinos
Previous Message Tomas Vondra 2024-09-24 09:52:11 Re: Why don't we consider explicit Incremental Sort?