From: | Ron Johnson <ronljohnsonjr(at)gmail(dot)com> |
---|---|
To: | Adrian Garcia Badaracco <adrian(at)adriangb(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Compressing large column by moving it to a unique table |
Date: | 2024-02-13 14:04:19 |
Message-ID: | CANzqJaBcxbbQQTS82eYpDV_RHEKUy0CDXSitQOAVdxJAQ3oq9Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
1. Show us the PG version, view definition, the exact query that's slow,
and the EXPLAIN (ANALYZE).
2. Presumably there's an index on each table's *_hash column?
On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <adrian(at)adriangb(dot)com>
wrote:
> Thank you for the reply Ron.
>
> Yes there are many fewer (<1%) the number of rows in new_table.
>
> Thanks for making me think of normalization, I hadn’t seen it that way.
> Although there is no theoretical relationship between the rows in the other
> columns in the original table and the attributes column, in practice there
> is a strong correlation, so I guess what I am trying to capture here is
> taking advantage of that correlation, while not completely depending on it
> because it can be broken.
>
> In any case, whatever theoretical framework is put around this solution, I
> am also interested in the practical aspects, in particular that case of
> selecting a subset of columns from the view that I know doesn’t need the
> join but the query planner thinks does.
>
> On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
> wrote:
>
>> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
>> adrian(at)adriangb(dot)com> wrote:
>>
>>> I am using Timescale so I'll be mentioning some timestamp stuff but I
>>> think this is a general postgres question for the most part.
>>>
>>> I have a table with some fixed, small columns (id, timestamp, etc) and a
>>> large JSONB column (let's call it `attributes`). `attributes` has 1000s of
>>> schemas, but given a schema, there's a lot of duplication. Across all rows,
>>> more than 99% of the data is duplicated (as measured by `count(attributes)`
>>> vs `count(distinct attributes)`.
>>>
>>> I can't normalize `attributes` into real columns because it is quite
>>> variable (remember 1000s of schemas).
>>>
>>> My best idea is to make a table like `(day timestamptz, hash text,
>>> attributes jsonb)` and then in my original table replace `attributes` with
>>> a reference to `new_table`.
>>>
>>
>> Meaning that there are many fewer rows in new_table?
>>
>>
>>> I can then make a view that joins them `select original_table.timestamp,
>>> new_table.attributes from original join new_table on (time_bucket('1 day',
>>> timestamp) = day AND original.hash = new_table.hash)` or something like
>>> that. The idea of time bucketing into 1 day is to balance write and read
>>> speed (by relying on timescale to do efficient time partitioning, data
>>> retention, etc.).
>>>
>>
>>> I recognize this is essentially creating a key-value store in postgres
>>> and also janky compression, so I am cautious about it.
>>>
>>
>> If my interpretation (that there are many fewer rows in new_table) is
>> correct, then you've stumbled into the Second Normal Form of database
>> design: https://en.wikipedia.org/wiki/Second_normal_form#Example
>>
>>
From | Date | Subject | |
---|---|---|---|
Next Message | Ron Johnson | 2024-02-13 14:10:48 | Re: MAT. VIEW security problems and PG 10-11 versions? |
Previous Message | Laurenz Albe | 2024-02-13 13:59:26 | Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s) |