Re: Compressing large column by moving it to a unique table

From: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Compressing large column by moving it to a unique table
Date: 2024-02-13 17:49:51
Message-ID: CANzqJaAEnfL=BpGDP=K3Yt34PyUkKzQ20avoJEN+caFefWVi2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On original_hashed, I think I'd try moving start_timestamp into its own
index.

On Tue, Feb 13, 2024 at 12:02 PM Adrian Garcia Badaracco <
adrian(at)adriangb(dot)com> wrote:

> I'm using PostgreSQL 15.5.
>
> Here's a self-contained example. I included links to public pgMustard
> query plans.
>
> Gist link:
> https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4
>
> Also copied below for archiving:
>
> ```sql
> CREATE OR REPLACE FUNCTION random_bytes(length integer)
> RETURNS bytea AS $$
> DECLARE
> bytes bytea := '';
> i integer := 0;
> BEGIN
> -- generate 1024 bytes at a time using gen_random_bytes(1024)
> WHILE i < length LOOP
> bytes := bytes || gen_random_bytes(least(1024, length - i));
> i := i + 1024;
> END LOOP;
>
> RETURN bytes;
> END;
> $$ LANGUAGE plpgsql;
>
> DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE;
> DROP VIEW IF EXISTS joined_view;
>
> CREATE TABLE original AS (
> WITH large_random_bytes AS (
> SELECT n AS id, random_bytes(4096 + n) AS attributes
> FROM generate_series(1, 1000) n
> )
> SELECT
> -- An incrementing timestamp
> '2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval
> AS start_timestamp,
> -- Another relatively small column, let's just make it a random string
> including `n` to make it unique
> 'random_string_' || (n % 100) AS event_name,
> -- The attributes column
> lrb.attributes AS attributes
> FROM generate_series(0, 1000000) n
> JOIN large_random_bytes lrb ON n % 1000 = lrb.id
> );
> CREATE INDEX ON original(start_timestamp);
>
> CREATE TABLE dictionary_table AS (
> SELECT DISTINCT time_bucket('1 day', start_timestamp) AS
> start_timestamp_range, attributes, md5(attributes) AS hash
> FROM original
> );
> CREATE INDEX ON dictionary_table (start_timestamp_range, hash);
>
> CREATE TABLE original_hashed AS (
> SELECT
> start_timestamp,
> event_name,
> md5(attributes) AS hash
> FROM original
> );
> CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day',
> start_timestamp), hash);
>
>
> CREATE VIEW joined_view AS (
> SELECT
> original_hashed.start_timestamp,
> original_hashed.event_name,
> dictionary_table.attributes
> FROM original_hashed
> LEFT JOIN dictionary_table ON (
> time_bucket('1 day', original_hashed.start_timestamp) =
> dictionary_table.start_timestamp_range
> AND
> original_hashed.hash = dictionary_table.hash
> )
> );
>
> -- Select all data
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM joined_view
> WHERE (
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM original
> WHERE (
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be
>
> -- A relatively aggregation selective query
> explain (analyze, buffers, verbose, settings, format json)
> SELECT count(*)
> FROM joined_view
> WHERE (
> get_byte(attributes, 4) < 100
> AND
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT count(*)
> FROM original
> WHERE (
> get_byte(attributes, 4) < 100
> AND
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8
>
> -- Select only some columns
> explain (analyze, buffers, verbose, settings, format json)
> SELECT start_timestamp
> FROM joined_view
> WHERE (
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT start_timestamp
> FROM original
> WHERE (
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a
>
>
> -- A relatively selective query
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM joined_view
> WHERE (
> get_byte(attributes, 4) < 100
> AND
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/a75b20bd-07d5-4402-a0d8-22419682307a
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM original
> WHERE (
> get_byte(attributes, 4) < 100
> AND
> start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
> AND
> start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/47ef84f0-a96e-4baa-af40-2ec241cbb6e2
> ```
>
> On Tue, Feb 13, 2024 at 11:04 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
> wrote:
>
>> 1. Show us the PG version, view definition, the exact query that's slow,
>> and the EXPLAIN (ANALYZE).
>> 2. Presumably there's an index on each table's *_hash column?
>>
>> On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <
>> adrian(at)adriangb(dot)com> wrote:
>>
>>> Thank you for the reply Ron.
>>>
>>> Yes there are many fewer (<1%) the number of rows in new_table.
>>>
>>> Thanks for making me think of normalization, I hadn’t seen it that way.
>>> Although there is no theoretical relationship between the rows in the other
>>> columns in the original table and the attributes column, in practice there
>>> is a strong correlation, so I guess what I am trying to capture here is
>>> taking advantage of that correlation, while not completely depending on it
>>> because it can be broken.
>>>
>>> In any case, whatever theoretical framework is put around this solution,
>>> I am also interested in the practical aspects, in particular that case of
>>> selecting a subset of columns from the view that I know doesn’t need the
>>> join but the query planner thinks does.
>>>
>>> On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>>> wrote:
>>>
>>>> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
>>>> adrian(at)adriangb(dot)com> wrote:
>>>>
>>>>> I am using Timescale so I'll be mentioning some timestamp stuff but I
>>>>> think this is a general postgres question for the most part.
>>>>>
>>>>> I have a table with some fixed, small columns (id, timestamp, etc) and
>>>>> a large JSONB column (let's call it `attributes`). `attributes` has 1000s
>>>>> of schemas, but given a schema, there's a lot of duplication. Across all
>>>>> rows, more than 99% of the data is duplicated (as measured by
>>>>> `count(attributes)` vs `count(distinct attributes)`.
>>>>>
>>>>> I can't normalize `attributes` into real columns because it is quite
>>>>> variable (remember 1000s of schemas).
>>>>>
>>>>> My best idea is to make a table like `(day timestamptz, hash text,
>>>>> attributes jsonb)` and then in my original table replace `attributes` with
>>>>> a reference to `new_table`.
>>>>>
>>>>
>>>> Meaning that there are many fewer rows in new_table?
>>>>
>>>>
>>>>> I can then make a view that joins them `select
>>>>> original_table.timestamp, new_table.attributes from original join new_table
>>>>> on (time_bucket('1 day', timestamp) = day AND original.hash =
>>>>> new_table.hash)` or something like that. The idea of time bucketing into 1
>>>>> day is to balance write and read speed (by relying on timescale to do
>>>>> efficient time partitioning, data retention, etc.).
>>>>>
>>>>
>>>>> I recognize this is essentially creating a key-value store in postgres
>>>>> and also janky compression, so I am cautious about it.
>>>>>
>>>>
>>>> If my interpretation (that there are many fewer rows in new_table) is
>>>> correct, then you've stumbled into the Second Normal Form of database
>>>> design: https://en.wikipedia.org/wiki/Second_normal_form#Example
>>>>
>>>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Wiwwo Staff 2024-02-13 18:34:17 Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)
Previous Message Adrian Garcia Badaracco 2024-02-13 17:02:42 Re: Compressing large column by moving it to a unique table