Re: Compress ReorderBuffer spill files using LZ4

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Julien Tachoires <julmon(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Compress ReorderBuffer spill files using LZ4
Date: 2024-07-16 14:01:46
Message-ID: 0861bc50-46a1-4b42-bfe6-b5e0c6050c66@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/16/24 14:52, Amit Kapila wrote:
> On Tue, Jul 16, 2024 at 12:58 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 7/15/24 20:50, Julien Tachoires wrote:
>>> Hi,
>>>
>>> Le ven. 7 juin 2024 à 06:18, Julien Tachoires <julmon(at)gmail(dot)com> a écrit :
>>>>
>>>> Le ven. 7 juin 2024 à 05:59, Tomas Vondra
>>>> <tomas(dot)vondra(at)enterprisedb(dot)com> a écrit :
>>>>>
>>>>> On 6/6/24 12:58, Julien Tachoires wrote:
>>>>>> ...
>>>>>>
>>>>>> When compiled with LZ4 support (--with-lz4), this patch enables data
>>>>>> compression/decompression of these temporary files. Each transaction
>>>>>> change that must be written on disk (ReorderBufferDiskChange) is now
>>>>>> compressed and encapsulated in a new structure.
>>>>>>
>>>>>
>>>>> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
>>>>> be supported even for pglz, or whatever algorithms we add in the future?
>>>>
>>>> That's right, reworking this patch in that sense.
>>>
>>> Please find a new version of this patch adding support for LZ4, pglz
>>> and ZSTD. It introduces the new GUC logical_decoding_spill_compression
>>> which is used to set the compression method. In order to stay aligned
>>> with the other server side GUCs related to compression methods
>>> (wal_compression, default_toast_compression), the compression level is
>>> not exposed to users.
>>>
>>
>> Sounds reasonable. I wonder if it might be useful to allow specifying
>> the compression level in those places, but that's clearly not something
>> this patch needs to do.
>>
>>> The last patch of this set is still in WIP, it adds the machinery
>>> required for setting the compression methods as a subscription option:
>>> CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
>>> I think there is a major problem with this approach: the logical
>>> decoding context is tied to one replication slot, but multiple
>>> subscriptions can use the same replication slot. How should this work
>>> if 2 subscriptions want to use the same replication slot but different
>>> compression methods?
>>>
>>
>> Do we really support multiple subscriptions sharing the same slot? I
>> don't think we do, but maybe I'm missing something.
>>
>>> At this point, compression is only available for the changes spilled
>>> on disk. It is still not clear to me if the compression of data
>>> transiting through the streaming protocol should be addressed by this
>>> patch set or by another one. Thought ?
>>>
>>
>> I'd stick to only compressing the data spilled to disk. It might be
>> useful to compress the streamed data too, but why shouldn't we compress
>> the regular (non-streamed) transactions too? Yeah, it's more efficient
>> to compress larger chunks, but we can fit quite large transactions into
>> logical_decoding_work_mem without spilling.
>>
>> FWIW I'd expect that to be handled at the libpq level - there's already
>> a patch for that, but I haven't checked if it would handle this. But
>> maybe more importantly, I think compressing streamed data might need to
>> handle some sort of negotiation of the compression algorithm, which
>> seems fairly complex.
>>
>> To conclude, I'd leave this out of scope for this patch.
>>
>
> Your point sounds reasonable to me. OTOH, if we want to support
> compression for spill case then shouldn't there be a question how
> frequent such an option would be required? Users currently have an
> option to stream large transactions for parallel apply or otherwise in
> which case no spilling is required. I feel sooner or later we will
> make such behavior (streaming=parallel) as default, and then spilling
> should happen in very few cases. Is it worth adding this new option
> and GUC if that is true?
>

I don't know, but streaming is 'off' by default, and I'm not aware of
any proposals to change this, so when you suggest "sooner or later"
we'll change this, I'd probably bet on "later or never".

I haven't been following the discussions about parallel apply very
closely, but my impression from dealing with similar stuff in other
tools is that it's rather easy to run into issues with some workloads,
which just makes me more skeptical about "streamin=parallel" by default.
But as I said, I'm out of the loop so I may be wrong ...

As for whether the GUC is needed, I don't know. I guess we might do the
same thing we do for streaming - we don't have a GUC to enable this, but
we default to 'off' and the client has to request that when opening the
replication connection. So it'd be specified at the subscription level,
more or less.

But then how would we specify compression for cases that invoke decoding
directly by pg_logical_slot_get_changes()? Through options?

BTW if we specify this at subscription level, will it be possible to
change the compression method?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2024-07-16 14:08:52 Re: [PATCH] Refactor pqformat.{c,h} and protocol.h
Previous Message Pavel Luzanov 2024-07-16 13:48:36 Re: Things I don't like about \du's "Attributes" column