Quick Links

Re: [HACKERS] Custom compression methods

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Custom compression methods
Date:	2017-11-21 17:47:49
Message-ID:	bfb3d34e-ba74-02cb-ca2a-43344866d9fd@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 11/21/2017 03:47 PM, Ildus Kurbangaliev wrote:
> On Mon, 20 Nov 2017 00:04:53 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> ...
>
>> 6) I'm rather confused by AttributeCompression vs.
>> ColumnCompression. I mean, attribute==column, right? Of course, one
>> is for data from parser, the other one is for internal info. But
>> can we make the naming clearer?
>
> For now I have renamed AttributeCompression to CompressionOptions,
> not sure that's a good name but at least it gives less confusion.
>

I propose to use either

CompressionMethodOptions (and CompressionMethodRoutine)

CompressionOptions (and CompressionRoutine)

>>
>> 7) The docs in general are somewhat unsatisfactory, TBH. For example
>> the ColumnCompression has no comments, unlike everything else in
>> parsenodes. Similarly for the SGML docs - I suggest to expand them to
>> resemble FDW docs
>> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
>> also follows the handler/routines pattern.
>
> I've added more comments. I think I'll add more documentation if the
> committers will approve current syntax.
>

OK. Haven't reviewed this yet.

>>
>> 8) One of the unclear things if why we even need 'drop' routing. It
>> seems that if it's defined DropAttributeCompression does something.
>> But what should it do? I suppose dropping the options should be done
>> using dependencies (just like we drop columns in this case).
>>
>> BTW why does DropAttributeCompression mess with att->attisdropped in
>> this way? That seems a bit odd.
>
> 'drop' routine could be useful. An extension could do something
> related with the attribute, like remove extra tables or something
> else. The compression options will not be removed after unlinking
> compression method from a column because there is still be stored
> compressed data in that column.
>

OK. So something like a "global" dictionary used for the column, or
something like that? Sure, seems useful and I've been thinking about
that, but I think we badly need some extension using that, even if in a
very simple way. Firstly, we need a "how to" example, secondly we need
some way to test it.

>>
>> 13) When writing the experimental extension, I was extremely
>> confused about the regular varlena headers, custom compression
>> headers, etc. In the end I stole the code from tsvector.c and
>> whacked it a bit until it worked, but I wouldn't dare to claim I
>> understand how it works.
>>
>> This needs to be documented somewhere. For example postgres.h has
>> a bunch of paragraphs about varlena headers, so perhaps it should
>> be there? I see the patch tweaks some of the constants, but does
>> not update the comment at all.
>
> This point is good, I'm not sure how this documentation should look
> like. I've just assumed that people should have deep undestanding of
> varlenas if they're going to compress them. But now it's easy to
> make mistake there. Maybe I should add some functions that help to
> construct varlena, with different headers. I like the way is how
> jsonb is constructed. It uses StringInfo and there are few helper
> functions (reserveFromBuffer, appendToBuffer and others). Maybe they
> should be not static.
>

Not sure. My main problem was not understanding how this affects the
varlena header, etc. And I had no idea where to look.

>>
>> Perhaps it would be useful to provide some additional macros
>> making access to custom-compressed varlena values easier. Or
>> perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already
>> support that? This part is not very clear to me.
>
> These macros will work, custom compressed varlenas behave like old
> compressed varlenas.
>

OK. But then I don't understand why tsvector.c does things like

VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
VARRAWSIZE_4B_C(data) - arrsize

instead of

VARSIZE_ANY_EXHDR(data) - arrsize
VARSIZE_ANY(data) - arrsize

Seems somewhat confusing.

>>> Still it's a problem if the user used for example `SELECT
>>> <compressed_column> INTO * FROM *` because postgres will copy
>>> compressed tuples, and there will not be any dependencies
>>> between destination and the options.
>>>
>>
>> This seems like a rather fatal design flaw, though. I'd say we need
>> to force recompression of the data, in such cases. Otherwise all
>> the dependency tracking is rather pointless.
>
> Fixed this problem too. I've added recompression for datum that use
> custom compression.
>

Hmmm, it still doesn't work for me. See this:

test=# create extension pg_lz4 ;
CREATE EXTENSION
test=# create table t_lz4 (v text compressed lz4);
CREATE TABLE
test=# create table t_pglz (v text);
CREATE TABLE
test=# insert into t_lz4 select repeat(md5(1::text),300);
INSERT 0 1
test=# insert into t_pglz select * from t_lz4;
INSERT 0 1
test=# drop extension pg_lz4 cascade;
NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to compression options for lz4
drop cascades to table t_lz4 column v
DROP EXTENSION
test=# \c test
You are now connected to database "test" as user "user".
test=# insert into t_lz4 select repeat(md5(1::text),300);^C
test=# select * from t_pglz ;
ERROR: cache lookup failed for compression options 16419

That suggests no recompression happened.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: [HACKERS] Custom compression methods at 2017-11-21 14:47:17 from Ildus Kurbangaliev

Responses

Re: [HACKERS] Custom compression methods at 2017-11-21 20:28:55 from Ildus K
Re: [HACKERS] Custom compression methods at 2017-11-23 09:38:49 from Ildus Kurbangaliev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2017-11-21 17:54:54	Re: View with duplicate GROUP BY entries
Previous Message	Stephen Frost	2017-11-21 17:28:03	Re: to_typemod(type_name) information function