Re: [PATCH] Compression dictionaries for JSONB

From: Nikita Malakhov <hukutoc(at)gmail(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2022-07-17 18:15:03
Message-ID: CAN-LCVMjxemT+Td7PDKQSW43xmZnMu1bAOukCgkBc9_v11mPvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

Aleksander, I've carefully gone over discussion and still have some
questions to ask -

1) Is there any means of measuring overhead of dictionaries over vanilla
implementation? IMO it is a must because
JSON is a widely used functionality. Also, as it was mentioned before, to
check the dictionary value must be detoasted;

2) Storing dictionaries in one table. As I wrote before, this will surely
lead to locks and waits while inserting and updating
dictionaries, and could cause serious performance issues. And vacuuming
this table will lead to locks for all tables using
dictionaries until vacuum is complete;

3) JSON documents in production environments could be very complex and use
thousands of keys, so creating dictionary
directly in SQL statement is not very good approach, so it's another reason
to have means for creating dictionaries as a
separate tables and/or passing them as files or so;

4) Suggested mechanics, if put on top of the TOAST, could not benefit from
knowledge if internal JSON structure, which
is seen as important drawback in spite of extensive research work done on
working with JSON schema (storing, validating,
etc.), and also it cannot recognize and help to compress duplicated parts
of JSON document;

5) A small test issue - if dictionaried' JSON has a key which is equal to
OID used in a dictionary for some other key?

In Pluggable TOAST we suggest that as an improvement compression should be
put inside the Toaster as an option,
thus the Toaster could have maximum benefits from knowledge of data
internal structure (and in future use JSON Schema).
For using in special Toaster for JSON datatype compression dictionaries
seem to be very valuable addition, but now I
have to agree that this feature in current state is competing with
Pluggable TOAST.

Thank you!

Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

On Tue, Jul 12, 2022 at 3:15 PM Aleksander Alekseev <
aleksander(at)timescale(dot)com> wrote:

> Hi Nikita,
>
> > Aleksander, please point me in the right direction if it was mentioned
> before, I have a few questions:
>
> Thanks for your feedback. These are good questions indeed.
>
> > 1) It is not clear for me, how do you see the life cycle of such a
> dictionary? If it is meant to keep growing without
> > cleaning up/rebuilding it could affect performance in an undesirable
> way, along with keeping unused data without
> > any means to get rid of them.
> > 2) From (1) follows another question - I haven't seen any means for
> getting rid of unused keys (or any other means
> > for dictionary cleanup). How could it be done?
>
> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
>
> > 4) If one dictionary is used by several tables - I see future issues in
> concurrent dictionary updates. This will for sure
> > affect performance and can cause unpredictable behavior for queries.
>
> You are right. Another reason to return to the idea of dictionary versions.
>
> > Also, I agree with Simon Riggs, using OIDs from the general pool for
> dictionary entries is a bad idea.
>
> Yep, we agreed to stop using OIDs for this, however this was not
> changed in the patch at this point. Please don't hesitate joining the
> effort if you want to. I wouldn't mind taking a short break from this
> patch.
>
> > 3) Is the possible scenario legal - by some means a dictionary does not
> contain some keys for entries? What happens then?
>
> No, we should either forbid removing dictionary entries or check that
> all the existing documents are not using the entries being removed.
>
> > If you have any questions on Pluggable TOAST don't hesitate to ask me
> and on JSONB Toaster you can ask Nikita Glukhov.
>
> Will do! Thanks for working on this and I'm looking forward to the
> next version of the patch for the next round of review.
>
> --
> Best regards,
> Aleksander Alekseev
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-17 18:40:32 Re: Making pg_rewind faster
Previous Message Martin Kalcher 2022-07-17 16:15:51 Re: Proposal to introduce a shuffle function to intarray extension