Re: Add ZSON extension to /contrib/

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add ZSON extension to /contrib/
Date: 2021-05-25 20:19:52
Message-ID: CAEze2WheMusc73UZ5TpfiAGQ=rRwSSgr0y3j9DEVAQgQFwneRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 25 May 2021 at 13:32, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
> <aleksander(at)timescale(dot)com> wrote:
> >
> > Hi hackers,
> >
> > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space, memory and increases the overall performance. More details can be found in README on GitHub.
> >
> > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people using Amazon RDS and similar services could enjoy it too.

Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

> > Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solve the real-world problems of the users. Time showed, however, that people are happy with the extension as it is. There were several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibility problems with the next major release of PostgreSQL.
> >
> > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread to the nearest CF and submit a corresponding patch.

I like the idea of the ZSON type, but I'm somewhat disappointed by its
current limitations:

- There is only one active shared dictionary (as a user I would want
distinct dictionaries for each use case, similar to ENUM: each ENUM
type has their own limit of 2**31 (?) values)
- There is no provided method to manually specify the dictionary (only
"zson_learn", which constructs a new dictionary)
- You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD
VALUE), you must create a new one.

Apart from that, I noticed the following more technical points, for if
you submit it as-is as a patch:

- Each dictionary uses a lot of memory, regardless of the number of
actual stored keys. For 32-bit systems the base usage of a dictionary
without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
more than I'd want to install.
- You call gettimeofday() in both dict_get and in get_current_dict_id.
These functions can be called in short and tight loops (for small GSON
fields), in which case it would add significant overhead through the
implied syscalls.
- The compression method you've chosen seems to extract most common
strings from the JSONB table, and then use that as a pre-built
dictionary for doing some dictionary encoding on the on-disk format of
the jsonb structure. Although I fully understand that this makes the
system quite easy to reason about, it does mean that you're deTOASTing
the full GSON field, and that the stored bytestring will not be
structured / doesn't work well with current debuggers.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

I don't think that this datatype (that supplies a basic but effective
compression algorithm over JSONB) is fit for core as-is.

I have also thought about building a similar type, but one that would
be more like ENUM: An extension on the JSONB datatype, which has some
list of common 'well-known' values that will be substituted, and to
which later more substitutable values can be added (e.g. CREATE TYPE
... AS JSONB_DICTIONARY ('"commonly_used_key"',
'"very_long_string_that_appears_often"', '[{"structure": "that",
"appears": "often"}]') or something similar). That would leave JSONB
just as a JSONB_DICTIONARY type without any substitutable values.

These specialized JSONB types could then be used as a specification
for table columns, custom types, et cetera. Some of the reasons I've
not yet built such type is me not being familiar with the jsonb- and
enum-code (which I suspect to be critical for an efficient
implementation of such type), although whilst researching I've noticed
that it is possible to use most of the JSONB infrastructure / read
older jsonb values, as there are still some JEntry type masks
available which could flag such substitutions.

With regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-05-25 20:21:21 Re: CALL versus procedures with output-only arguments
Previous Message Tom Lane 2021-05-25 20:10:29 Re: Add ZSON extension to /contrib/