Re: [PATCH] Compression dictionaries for JSONB

From: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: Jacob Champion <jchampion(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2022-06-23 15:48:48
Message-ID: CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2 Jun 2022 at 14:30, Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:

> > I saw there was some previous discussion about dictionary size. It
> > looks like this approach would put all dictionaries into a shared OID
> > pool. Since I don't know what a "standard" use case is, is there any
> > risk of OID exhaustion for larger deployments with many dictionaries?
> > Or is 2**32 so comparatively large that it's not really a serious
> > concern?
>
> I agree, this is a drawback of the current implementation. To be honest,
> I simply followed the example of how ENUMs are implemented. I'm not 100% sure
> if we should be worried here (apparently, freed OIDs are reused). I'm OK with
> using a separate sequence if someone could second this. This is the first time
> I'm altering the catalog so I'm not certain what the best practices are.

The goal of this patch is great, thank you for working on this (and ZSON).

The approach chosen has a few downsides that I'm not happy with yet.

* Assigning OIDs for each dictionary entry is not a great idea. I
don't see why you would need to do that; just assign monotonically
ascending keys for each dictionary, as we do for AttrNums.

* There is a limit on SQL statement size, which will effectively limit
the size of dictionaries, but the examples are unrealistically small,
so this isn't clear as a limitation, but it would be in practice. It
would be better to specify a filename, which can be read in when the
DDL executes. This can be put into pg_dump output in a similar way to
the COPY data for a table is, so once read in it stays static.

* The dictionaries are only allowed for certain datatypes. This should
not be specifically limited by this patch, i.e. user defined types
should not be rejected.

* Dictionaries have no versioning. Any list of data items changes over
time, so how do we express that? Enums were also invented as static
lists originally, then had to be modified later to accomodate
additions and revisions, so let's think about that now, even if we
don't add all of the commands in one go. Currently we would have to
create a whole new dictionary if even one word changes. Ideally, we
want the dictionary to have a top-level name and then have multiple
versions over time. Let's agree how we are going do these things, so
we can make sure the design and code allows for those future
enhancements.
i.e. how will we do ALTER TABLE ... UPGRADE DICTIONARY without causing
a table rewrite?

* Does the order of entries in the dictionary allow us to express a
priority? i.e. to allow Huffman coding.

Thanks for your efforts - this is a very important patch.

--
Simon Riggs http://www.EnterpriseDB.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2022-06-23 17:13:26 Re: O(n) tasks cause lengthy startups and checkpoints
Previous Message Simon Riggs 2022-06-23 15:19:45 Re: allow specifying action when standby encounters incompatible parameter settings