Re: Compressing large column by moving it to a unique table

From: Adrian Garcia Badaracco <adrian(at)adriangb(dot)com>
To: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Compressing large column by moving it to a unique table
Date: 2024-02-13 13:48:08
Message-ID: CAE8z92Hgn4Qvn+XxBNTE+Gr7BPZRYRJ=1gevhv1f5XaNj4ygTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution, I
am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:

> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
> adrian(at)adriangb(dot)com> wrote:
>
>> I am using Timescale so I'll be mentioning some timestamp stuff but I
>> think this is a general postgres question for the most part.
>>
>> I have a table with some fixed, small columns (id, timestamp, etc) and a
>> large JSONB column (let's call it `attributes`). `attributes` has 1000s of
>> schemas, but given a schema, there's a lot of duplication. Across all rows,
>> more than 99% of the data is duplicated (as measured by `count(attributes)`
>> vs `count(distinct attributes)`.
>>
>> I can't normalize `attributes` into real columns because it is quite
>> variable (remember 1000s of schemas).
>>
>> My best idea is to make a table like `(day timestamptz, hash text,
>> attributes jsonb)` and then in my original table replace `attributes` with
>> a reference to `new_table`.
>>
>
> Meaning that there are many fewer rows in new_table?
>
>
>> I can then make a view that joins them `select original_table.timestamp,
>> new_table.attributes from original join new_table on (time_bucket('1 day',
>> timestamp) = day AND original.hash = new_table.hash)` or something like
>> that. The idea of time bucketing into 1 day is to balance write and read
>> speed (by relying on timescale to do efficient time partitioning, data
>> retention, etc.).
>>
>
>> I recognize this is essentially creating a key-value store in postgres
>> and also janky compression, so I am cautious about it.
>>
>
> If my interpretation (that there are many fewer rows in new_table) is
> correct, then you've stumbled into the Second Normal Form of database
> design: https://en.wikipedia.org/wiki/Second_normal_form#Example
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2024-02-13 13:59:26 Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)
Previous Message Wiwwo Staff 2024-02-13 11:17:57 FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)